Pair-Trading with S&P500 Companies - Part I.

In my recent post I wrote the code to download historical data for companies included in S&P500 index. Today I would like to perform statistical procedures to identify whether certain pair of stocks is co-integrated or not.

Since there are approximately 500 companies that means I will need to perform  {{500}\choose{2}} = 124 750 calculations of testing. First of all I will divide my data into two parts: learning period and testing period. I will find co-integrated pairs in learning period and test the pair-trading strategy in testing period.

Secondly, I will use linear regression to calculate the spread (residuals) of the two corresponding stocks. To find out whether this two stocks are co-integrated or not I will perform the Augmented Dickey-Fuller Unit Root Test on spread to reject or not the hypothesis of stationarity. I will save the p-value of this test to the matrix. I will use the ADF test from fUnitRoots package.

And finally, I will apply this procedure on all the pairs. Uff.. that means a lot of computational time.. so take a break and have a cup of great coffee. See you in the next part of this tutorial.

Feel free to use the source code. Your valuable comments are more than welcome!

 

View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
rm(list = ls(all = TRUE))
 
library(quantmod)
library(tseries)
library(timeDate)
library(fUnitRoots)
 
 
load(file = "/home/robo/workspace/R-Test/abcdeXXX.RData")
ht <- matrix(data = NA, ncol = nrStocks, nrow = nrStocks)
beta <- matrix(data = NA, ncol = nrStocks, nrow = nrStocks)
sprd <- list()
 
z_old <- z;
nDays <- length(z[,1])
 
# seting learning and testing periods
testPeriod <- 250
learningPeriod <- 500
 
testDates <- (nDays-testPeriod):nDays
learningDates <- (nDays - testPeriod - learningPeriod):(nDays - testPeriod)
 
data <- z		
z <- z[learningDates,]
zTest <- data[testDates,]
 
# here we go! let's find the cointegrated pairs
for (j in 1:(nrStocks-1)) {
	for (i in (j+1):nrStocks) {
 
		cat("Calculating ", j, " - ", i , "\n")
		if (length(na.omit(z[, i])) == 0 || length(na.omit(z[, j])) == 0) {
			beta[j,i] <- NA
			ht[j,i] <- NA
			next
		}
 
 
		m <- lm(z[, j] ~ z[, i] + 0)
		beta[j,i] <- coef(m)[1]
 
		sprd <- resid(m)
 
		ht[j,i] <- adfTest(na.omit(coredata(sprd)), type="nc")@test$p.value
 
	}
}
 
#save(list = ls(all=TRUE), file = "/home/robo/Dropbox/work/FX/PairTrading/cointeg.RData")
#######################################################################################

Tags: , , , , ,

  1. coiler’s avatar

    Hi,

    how do you prepare the data series?, could you give an example on how the csv files has to be formated to run the scipt?.

    Regards.

    Reply

    1. buddha’s avatar

      Could you share the formatting for the data required for this code?

      Maybe even a small sample?

      Thanks a lot!

      Reply

    2. Max’s avatar

      Hey you,

      I appreciate your nice work here. But reading it there are too questions coming up:

      (1) Why do you choose to do this:
      m <- lm(z[, j] ~ z[, i] + 0) instead of m <- lm(z[, j] ~ z[, i] )

      (2) As I am a R newbie: What means the [1] in this expression: beta[j,i] <- coef(m)[1]

      Thx and all the best
      Max

      Reply

    3. john’s avatar

      Hi,

      Could you please tell me how I need to change the line 9 (load(file = "/home/robo/workspace/R-Test/abcdeXXX.RData")) to make your code run on my machine ?

      I changed it to load(file = "c:\\abcdeXXX.RData") but I am getting error msg " Error in if (!grepl("RD[AX]2\n", magic)) { : argument is of length zero " .

      Thanks !

      Reply

    4. Mariachi’s avatar

      You can try the optimized versions of R available at http://www.revolutionanalytics.com/

      Much faster computation

      Reply

      1. QuantTrader’s avatar

        Hi Mariachi,

        thanks for your comment. I do my research in Linux - Ubuntu and the community version of Revolution R is available only for Mac and Win. But on the other hand I can at least try it under Win and see if I can get the results faster.

        Reply

      2. Aris’s avatar

        5 years of data, takes about 15-20 mins.Thanks..keep doing this stuff you may stumble with the most amazing stuff

        Reply

      3. QuantTrader’s avatar

        Hello Aris!

        Well, the code above is quite ineffective and not yet optimized. It takes around one hour to calculate all the lines. What was the time horizon you were taking into consideration? (I took 500 working days ~ 2 years of data).

        Btw. thank you for comment. I'm regular reader of your wonderful blog.

        Reply

      4. aris david’s avatar

        I've done this test before in Matlab it takes 15 minutes, I'd say. How long does it take with R.

        Reply

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.