Pair-Trading with S&P500 Companies - Part II.

Today I'm going to share with you further outcomes of my research in statistical arbitrage trading technique - pair-trading. In the first part of pair-trading with S&P500 Companies I used downloaded data from yahoo to identify co-integrated pairs.

Next stage is to take closer look at results and identify potentially profitable pairs. As an example take this chart:

bad spread

Bad Spread

 

The ADF test identified this spread as stationary. Obviously this is not a type of spread I would like to trade. The question is: How to select the most interesting pairs from the big pile?

First of all, here are few numbers:

  • Total number of pairs:  124 251
  • Co-integrated pairs:  28 871

That's quite a lot! So how to find them?

This is the tricky question and I am not very satisfied with solution which I found. Your comments are VERY welcome!

There is a fantastic function in R called summary() which calculates minimum, first quartile, mean, median, third quartile and maximum. What I expect from "good" spread is that its 1st and 3rd quartile to be less than -1 and 1, respectively. This little trick discarded cases displayed on the chart above.

Next step is ordering pair spreads according their standard deviation. The higher the standard deviation the better for us! That's because high volatile spread creates more opportunities to trade it. I calculated also the z-score (I hope it is calculated this way..) but did not use it. And what is the result? After the elimination we have 1210 pairs! Sound's good.

Here is an example of "good" spread. (Top plot is price i VS price j * beta[j,i] ; middle one is spread over learning period and the bottom one is spread over testing period).

Good spread

Good spread

 

What I am going to do next is to test the best candidates over the testing period and see whether this strategy is profitable or not.Of course, the only information to determine the best candidates is spread over learning period. If you have any questions or suggestions please feel free to comment. See you in next part!

Here is the source code. It's not efficient to run the double for cycle again, but I use it here for better illustration. Enjoy!

 

View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# load the libraries
library(quantmod)
library(tseries)
library(timeDate)
library(fUnitRoots)
 
load(file = "/home/robo/Dropbox/work/FX/PairTrading/cointeg.RData")
 
# prepare variables
zscore <- 0;
rscore <- matrix(data = NA, ncol = 4, nrow = (nrStocks^2)/2)
pairSummary <- matrix(data = NA, ncol = 6, nrow = (nrStocks^2)/2)
 
ii <- 1;
 
# lets evaluate the spreads
for (j in 1:(nrStocks-1)) {
	for (i in (j+1):nrStocks) {
 
                # if no data, skip
		if (is.na(ht[j, i])) {
			next
		}
 
                # is spread stationary (i.e. pair is co-integrated)
		if (ht[j, i] < 0.02) {
 
			sprd <- z[,j] - beta[j, i]*z[,i]
			sprd <- na.omit(sprd)
 
                        # calculate z-score
			zscore <- sum(abs(scale((sprd))))/length(sprd)
                        rscore[ii, 3] <- sd((sprd))
			rscore[ii, 4] <- zscore
			rscore[ii, 1] <- j
			rscore[ii, 2] <- i
 
                        pairSummary[ii, ] = summary(coredata(sprd))[1:6]
			ii <- ii + 1
		}
	}
 
	cat("Calculating ", j, "\n")
}
 
#save(list = ls(all=TRUE), file = "/home/robo/Dropbox/work/FX/PairTrading/analysis4.RData")
#load(file = "/home/robo/Dropbox/work/FX/PairTrading/analysis4.RData")
 
# set up boundaries for 1st and 3rd quartiles
badSprd_up <- 1
badSprd_down <- -1
 
# re-order spreads
rscore <- na.remove(rscore)
pairSummary <- na.remove(pairSummary)
 
order_id <- order((rscore[,3]), decreasing = T)
rscore <- rscore[order_id,]
pairSummary <- pairSummary[order_id,]
 
goodSprd_id <- (pairSummary[, 2] >  badSprd_down) & (pairSummary[, 5] <  badSprd_up)
 
rscore <- rscore[goodSprd_id, ]
pairSummary <- pairSummary[goodSprd_id, ]
 
sddist <- 2
boundary <- 4.5
 
for (pos in 1:length(rscore[,1])) {
	j <- rscore[pos, 1]
	i <- rscore[pos, 2]
 
	sprd <- na.omit(z[,j] - beta[j, i]*z[,i])
	sprdTest <- na.omit(zTest[,j] - beta[j, i]*zTest[,i])
 
        sprd_mean = mean(sprd, na.rm = T)
        sprd_sd = sd(sprd, na.rm = T)
 
        lb = sprd_mean - boundary*sprd_sd
        ub = sprd_mean + boundary*sprd_sd
 
	par(mfrow=c(3,1))
        plot(z[, j], type = "l", col = "blue")
        title(main = rscore[pos, 1:2])
	points(beta[j, i]*z[, i], type = "l", col = "red")
 
        plot(sprd, ylim = c(lb, ub))
	abline(h = (sprd_mean - sddist*sprd_sd), col = "red")
	abline(h = (sprd_mean + sddist*sprd_sd), col = "red")
 
        plot(sprdTest, , ylim = c(lb, ub))
	abline(h = (sprd_mean - sddist*sprd_sd), col = "red")
	abline(h = (sprd_mean + sddist*sprd_sd), col = "red")
 
	#Sys.sleep(1)
        readline()
}
 
#save(list = ls(all=TRUE), file = "/home/robo/workspace/R-Test/analysis2.RData")

Tags: , , , , , ,

  1. Abhay’s avatar

    Hi,

    I am working on pair strategy on Indian Market. I have co-integration for all possible liquid pairs and found that there are hardly 5-10 Pairs pass ADF.

    I wanted to replace ADF with Variance Ratio Test to trade on Z score for Highly Correlated Pairs. How this sounds ?

    Since the idea is Variance Ratio Test will tell as how much time it'll take to mean revert and with the help of Z score we can enter into trade.

    Thanks

    Reply

  2. David Stewart’s avatar

    I may be missing something, but it seems to me that you need to be using the percent gain and loss rather than the stock prices for this exercise. At least with regards to the charting of cointegrated pairs, I think you are missing out because of the lack of a common metric. The use of percent gain and loss would provide such a common metric, expanding the universe of stocks that could be considered cointegrated.

    Reply

  3. Samo’s avatar

    Hello. Thanks for sharing this and other (future) blog posts.

    Can you also share the exact code how you came to cointeg.RData . I have read the http://blog.quanttrader.org/2011/03/downloading-sp-500-data-to-r/ but can you lease share the exact code which outputs the cointeg.RData into some folder. Thnx.

    Best,
    Samo.

    Reply

    1. QuantTrader’s avatar

      Hi Samo,

      I am sure you can find the code here: http://blog.quanttrader.org/2011/03/pair-trading-with-sp500-companies-part-i/

      But if you want the more recent and updated version you should download the code from the bottom of the post here:

      http://blog.quanttrader.org/2011/04/optimizing-my-r-code/

      Reply

    2. agathocles’s avatar

      Hi QuantTrader,

      Can you explain to me what you mean by the following?

      "What I expect from "good" spread is that its 1st and 3rd quartile to be less than -1 and 1, respectively."

      Very few spreads, and in fact the one you display above, do not have spreads between -1 and 1. The good spread you plot above has a spread between approximately -5 and 5. Do you instead mean that the zscored spread stays within 1 and -1? In which case, your code does not do that.

      Additionally, how do you use summary() to get the statistics you need? For me, the summary() outputs data in string format. Correct me if I'm wrong, but I think the function you are looking for is quantile(), which outputs a numeric vector.

      Lastly, it seems you might be getting aberrant phenomenon and weird spreads as a result of using late 2008 and early 2009 data. Markets were extremely distorted during these times, and it is likely throwing off your results. One thought is that you might want to use intraday data to boost the number of observations but stay within the current market regime. There are several vendors that provide intraday minute bar data on the cheap.

      Awesome stuff! I look forward to reading more...

      Reply

      1. QuantTrader’s avatar

        hi agathocles!

        thank you for your comment! That quote means that I expect from the "good" spread to by more centralized around zero. The graph examples are better way to understand that. Secondly, it means that at least 1/2 of the values of the spread is inside that interval. It doesn't mean that WHOLE spread must be within -1 and 1.

        Output of summary is not a string format. It's a structured variable and I extract from it only the numeric values: summary(coredata(sprd))[1:6] .. note that [1:6] behind the summary function. You can use quantile function, but this way you have min, max, 1st, 3rd quartile, mean, median on one place.

        Yes, the 2008, 2009 were pretty weird years but as you can see, there are lots of co-integrated pairs anyway.

        Reply

      2. kafka’s avatar

        I think, here is one problem. You derived beta parameter from whole period and then you found the spread for each day backwards. In real life, you don't know the future spread, so you need to update beta parameter every day. In that case the results can be different.

        Reply

        1. QuantTrader’s avatar

          hi kafka,

          thank you for your comment! That's the reason why I used "learning" and "testing" period. This way I am "simulating" one day of trading, i.e. I identify co-integrated pairs over the learning period and then I am going to test trading them over the testing period. The bottom chart is just for illustrative purposes to see what was the spread over testing period with beta from learning period. All I am doing here is finding the "most suitable" pairs for trading (from historical point of view) but not testing the trading itself.

          PS: I read a few posts from your blog and seems very interesting!

          Reply

        2. Andreas’s avatar

          I think I would try to establish some "null hypothesis", e.g. perform ADF Test or even backtesting on some randomly generated stock charts. These could be done using stochastic chaos, or geometric random walk, or you could resample daily stock price gains to form new charts.

          Then you take the resulting distributions and compare it with the pairs you are generating. That should give you an idea on what to expect when testing a lot of stocks without any spread effect at all. Probably you can't avoid getting some "false positives", even if you compare longer time spans. Many traders seem to be using "common sense" as a way to avoid overfitting/choosing false positives: Only trade pairs if the pairing makes some sense at all. The combined probability of the spread making sense and achieving high profitability would make you sleep better. Don't know if that's a good way to go about it. You can see sense where there is none and overlook it, if it indeed existed...

          Reply

          1. QuantTrader’s avatar

            Hi Andreas,

            thanks for your visit and comment! The null hypothesis of stationarity and ADF testing is performed in Part I. of this tutorial series.

            Well, that would require generating 499 random walks with corresponding correlations. Could you please explain in more detail what is the purpose of this idea?

            I am sure that this way we will identify pairs which are "physically" co-integrated and best suitable for pair-trading.

            Reply

            1. HoRoWo’s avatar

              Is your code on row 62 and 63 correct ?

              Reply

              1. QuantTrader’s avatar

                Hi, it was not correct. I have changed the codes of symbols for symbols themselves. Thank you.

                Reply

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.