Today I'm going to share with you further outcomes of my research in statistical arbitrage trading technique  pairtrading. In the first part of pairtrading with S&P500 Companies I used downloaded data from yahoo to identify cointegrated pairs.
Next stage is to take closer look at results and identify potentially profitable pairs. As an example take this chart:
The ADF test identified this spread as stationary. Obviously this is not a type of spread I would like to trade. The question is: How to select the most interesting pairs from the big pile?
First of all, here are few numbers:
 Total number of pairs: 124 251
 Cointegrated pairs: 28 871
That's quite a lot! So how to find them?
This is the tricky question and I am not very satisfied with solution which I found. Your comments are VERY welcome!
There is a fantastic function in R called summary() which calculates minimum, first quartile, mean, median, third quartile and maximum. What I expect from "good" spread is that its 1st and 3rd quartile to be less than 1 and 1, respectively. This little trick discarded cases displayed on the chart above.
Next step is ordering pair spreads according their standard deviation. The higher the standard deviation the better for us! That's because high volatile spread creates more opportunities to trade it. I calculated also the zscore (I hope it is calculated this way..) but did not use it. And what is the result? After the elimination we have 1210 pairs! Sound's good.
Here is an example of "good" spread. (Top plot is price i VS price j * beta[j,i] ; middle one is spread over learning period and the bottom one is spread over testing period).
What I am going to do next is to test the best candidates over the testing period and see whether this strategy is profitable or not.Of course, the only information to determine the best candidates is spread over learning period. If you have any questions or suggestions please feel free to comment. See you in next part!
Here is the source code. It's not efficient to run the double for cycle again, but I use it here for better illustration. Enjoy!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99  # load the libraries library(quantmod) library(tseries) library(timeDate) library(fUnitRoots) load(file = "/home/robo/Dropbox/work/FX/PairTrading/cointeg.RData") # prepare variables zscore < 0; rscore < matrix(data = NA, ncol = 4, nrow = (nrStocks^2)/2) pairSummary < matrix(data = NA, ncol = 6, nrow = (nrStocks^2)/2) ii < 1; # lets evaluate the spreads for (j in 1:(nrStocks1)) { for (i in (j+1):nrStocks) { # if no data, skip if (is.na(ht[j, i])) { next } # is spread stationary (i.e. pair is cointegrated) if (ht[j, i] < 0.02) { sprd < z[,j]  beta[j, i]*z[,i] sprd < na.omit(sprd) # calculate zscore zscore < sum(abs(scale((sprd))))/length(sprd) rscore[ii, 3] < sd((sprd)) rscore[ii, 4] < zscore rscore[ii, 1] < j rscore[ii, 2] < i pairSummary[ii, ] = summary(coredata(sprd))[1:6] ii < ii + 1 } } cat("Calculating ", j, "\n") } #save(list = ls(all=TRUE), file = "/home/robo/Dropbox/work/FX/PairTrading/analysis4.RData") #load(file = "/home/robo/Dropbox/work/FX/PairTrading/analysis4.RData") # set up boundaries for 1st and 3rd quartiles badSprd_up < 1 badSprd_down < 1 # reorder spreads rscore < na.remove(rscore) pairSummary < na.remove(pairSummary) order_id < order((rscore[,3]), decreasing = T) rscore < rscore[order_id,] pairSummary < pairSummary[order_id,] goodSprd_id < (pairSummary[, 2] > badSprd_down) & (pairSummary[, 5] < badSprd_up) rscore < rscore[goodSprd_id, ] pairSummary < pairSummary[goodSprd_id, ] sddist < 2 boundary < 4.5 for (pos in 1:length(rscore[,1])) { j < rscore[pos, 1] i < rscore[pos, 2] sprd < na.omit(z[,j]  beta[j, i]*z[,i]) sprdTest < na.omit(zTest[,j]  beta[j, i]*zTest[,i]) sprd_mean = mean(sprd, na.rm = T) sprd_sd = sd(sprd, na.rm = T) lb = sprd_mean  boundary*sprd_sd ub = sprd_mean + boundary*sprd_sd par(mfrow=c(3,1)) plot(z[, j], type = "l", col = "blue") title(main = rscore[pos, 1:2]) points(beta[j, i]*z[, i], type = "l", col = "red") plot(sprd, ylim = c(lb, ub)) abline(h = (sprd_mean  sddist*sprd_sd), col = "red") abline(h = (sprd_mean + sddist*sprd_sd), col = "red") plot(sprdTest, , ylim = c(lb, ub)) abline(h = (sprd_mean  sddist*sprd_sd), col = "red") abline(h = (sprd_mean + sddist*sprd_sd), col = "red") #Sys.sleep(1) readline() } #save(list = ls(all=TRUE), file = "/home/robo/workspace/RTest/analysis2.RData") 
Tags: arbitrage, cointegration, cointegration, pair tradin, pairtrading, R, statistical arbitrage

Hi,
I am working on pair strategy on Indian Market. I have cointegration for all possible liquid pairs and found that there are hardly 510 Pairs pass ADF.
I wanted to replace ADF with Variance Ratio Test to trade on Z score for Highly Correlated Pairs. How this sounds ?
Since the idea is Variance Ratio Test will tell as how much time it'll take to mean revert and with the help of Z score we can enter into trade.
Thanks

I may be missing something, but it seems to me that you need to be using the percent gain and loss rather than the stock prices for this exercise. At least with regards to the charting of cointegrated pairs, I think you are missing out because of the lack of a common metric. The use of percent gain and loss would provide such a common metric, expanding the universe of stocks that could be considered cointegrated.

Hello. Thanks for sharing this and other (future) blog posts.
Can you also share the exact code how you came to cointeg.RData . I have read the http://blog.quanttrader.org/2011/03/downloadingsp500datator/ but can you lease share the exact code which outputs the cointeg.RData into some folder. Thnx.
Best,
Samo.
Hi QuantTrader,
Can you explain to me what you mean by the following?
"What I expect from "good" spread is that its 1st and 3rd quartile to be less than 1 and 1, respectively."
Very few spreads, and in fact the one you display above, do not have spreads between 1 and 1. The good spread you plot above has a spread between approximately 5 and 5. Do you instead mean that the zscored spread stays within 1 and 1? In which case, your code does not do that.
Additionally, how do you use summary() to get the statistics you need? For me, the summary() outputs data in string format. Correct me if I'm wrong, but I think the function you are looking for is quantile(), which outputs a numeric vector.
Lastly, it seems you might be getting aberrant phenomenon and weird spreads as a result of using late 2008 and early 2009 data. Markets were extremely distorted during these times, and it is likely throwing off your results. One thought is that you might want to use intraday data to boost the number of observations but stay within the current market regime. There are several vendors that provide intraday minute bar data on the cheap.
Awesome stuff! I look forward to reading more...

I think, here is one problem. You derived beta parameter from whole period and then you found the spread for each day backwards. In real life, you don't know the future spread, so you need to update beta parameter every day. In that case the results can be different.

I think I would try to establish some "null hypothesis", e.g. perform ADF Test or even backtesting on some randomly generated stock charts. These could be done using stochastic chaos, or geometric random walk, or you could resample daily stock price gains to form new charts.
Then you take the resulting distributions and compare it with the pairs you are generating. That should give you an idea on what to expect when testing a lot of stocks without any spread effect at all. Probably you can't avoid getting some "false positives", even if you compare longer time spans. Many traders seem to be using "common sense" as a way to avoid overfitting/choosing false positives: Only trade pairs if the pairing makes some sense at all. The combined probability of the spread making sense and achieving high profitability would make you sleep better. Don't know if that's a good way to go about it. You can see sense where there is none and overlook it, if it indeed existed...


13 comments
Comments feed for this article
Trackback link: http://blog.quanttrader.org/2011/04/pairtradingwithsp500companiespartii/trackback/