Optimizing My R Code

Thanks to gappy3000 I optimized my code in few several ways. The original script lasted ~30 minutes thanks to using pure vectors instead of zoo objects.

First of all I changed all lm functions for lm.fit functions. Although lm.fit function cannot handle data with NAs and I had to handle it separately, this little improvement helped to get results in ~25 minutes.

Secondly gappy3000 assumed that I had my BLAS and LAPACK libraries optimized.. which I did, BUT I installed R on my Linux system from binaries and did not compiled it by myself. So.. I reinstalled my R.

If you wonder how you can optimize your BLAS on Linux, just run this small piece of code in terminal:

 

 sudo apt-get install libatlas-base-dev

 

This will install the base ATLAS libraries to your system. ATLAS is acronym for Automatically Tuned Linear Algebra Software, and you can find more info here. Ok, I have the ATLAS libraries, what to do next?

Download the source code of R here, read the installation instructions here and I ended up with this:

 

./configure --with-blas="-lf77blas -latlas"
 make
 make check
 sudo make install

 

,or if you have multiple processors you should edit the first line with

 

./configure --with-blas="-lptf77blas -lpthread -latlas"

 

And if you want to run R in Eclipse or RStudio, you should also add:

 

./configure --with-blas="-lf77blas -latlas" --enable-R-shlib

 

Having done these steps the "cointegration" R script was executed in 20 minutes.

To see the results from different angle I run this code (from gappy3000) before and after re-installation of R:

 

View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
require(rbenchmark)
 
x <- matrix(rnorm(5e6), ncol=5)
beta <- matrix(1:5, ncol=1)
y <- x %*% beta + rnorm(ncol(x), 0,5)
A <- data.frame(y, x)
 
benchmark(
lm.fit(x, y),
lm( y ~ x + 0),
lm( y ~ . + 0, data=A),
columns=c("test", "replications", "elapsed", "relative"),
order="relative",
replications=1
)

BEFORE:

test replications elapsed relative

test replications elapsed relative
1            lm.fit(x, y)            1    3.71 1.000000
3 lm(y ~ . + 0, data = A)         1   11.20 3.018868
2           lm(y ~ x + 0)            1   12.40 3.342318

AFTER:

test replications elapsed relative

test replications elapsed relative
1            lm.fit(x, y)            1   1.815 1.000000
3 lm(y ~ . + 0, data = A)         1   3.547 1.954270
2           lm(y ~ x + 0)            1   5.340 2.942149

If you have any suggestions how to install or improve the performance of R, please, write your comment and I will update this post accordingly. Thank you.

 

I also paste the final code (you will need this file and modify the path to it):

 

View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
rm(list = ls(all = TRUE))
 
require(quantmod)
require(tseries)
require(timeDate)
require(fUnitRoots)
 
 
load(file = "/home/robo/Dropbox/work/FX/PairTrading/sp500data.RData")
#load(file = "F:\\work\\FX\\PairTrading\\sp500data.RData")
#load(file = "C:\\Users\\robo\\Documents\\My Dropbox\\work\\FX\\PairTrading\\sp500data.RData")
 
ht <- matrix(data = NA, ncol = nrStocks, nrow = nrStocks)
beta <- matrix(data = NA, ncol = nrStocks, nrow = nrStocks)
 
z_old <- z;
nDays <- length(z[,1])
 
# seting learning and testing periods
testPeriod <- 10
learningPeriod <- 252*5
 
testDates <- (nDays-testPeriod):nDays
learningDates <- (nDays - testPeriod - learningPeriod):(nDays - testPeriod)
 
data <- z  	
zLearn <- z[learningDates,]
zTest <- data[testDates,]
 
z <- coredata(zLearn)
 
zscore <- 0;
rscore <- matrix(data = NA, ncol = 4, nrow = (nrStocks^2)/2)
pairSummary <- matrix(data = NA, ncol = 6, nrow = (nrStocks^2)/2)
ii <- 1;
 
# here we go! let's find the cointegrated pairs
for (j in 1:(nrStocks-1)) {
 
  cat("Calculating ", j, "\n")
 
  zjNA <- is.na(z[,j])
  if (sum(zjNA) > learningPeriod/2) {
    next  
  }
 
	for (i in (j+1):nrStocks) {
 
		ziNA <- is.na(z[,i])
		if (sum(ziNA) > learningPeriod/2) {
			next
		}
 
		notNA <- !(zjNA | ziNA)
 
		m <- lm.fit(matrix(z[notNA, i], ncol = 1), z[notNA, j])
		beta[j,i] <- coef(m)[1]
		sprd <- resid(m)
 
		ht[j,i] <- adfTest(sprd, type="nc")@test$p.value
 
		if (ht[j,i] < 0.02) {
 
			# calculate z-score
			zscore <- sum(abs(scale((sprd))))/length(sprd)
 
			rscore[ii, 3] <- sd((sprd))
			rscore[ii, 4] <- zscore
			rscore[ii, 1] <- j
			rscore[ii, 2] <- i
 
			pairSummary[ii, ] = summary(sprd)[1:6]
			ii <- ii + 1
		}
 
	}
}
 
#save(list = ls(all=TRUE), file = "/home/robo/Dropbox/work/FX/PairTrading/cointeg5.RData")
#######################################################################################

Tags: , , ,

  1. wahid’s avatar

    Hello,

    First, thank you for this very enriching blog.

    I am currently doing an internship in which I have to write algorithms that find cointentegrated pairs and design trading strategies based on them, using R.

    I have three questions concerning your post on pair trading:

    1. Why don't you verify that the two stocks of a pair are integrated of order 1? And can this be done using ADF test?

    2. ADF test is implemented in many packages of the R statistics tool, what would be the one that best suits pair trading?

    3. Finally, can we use Johanson test instead of ADF test and why would that be, or not, better?

    I apologize for this rather long message and thank you in advance for your answer.

    Sincerly,

    Reply

  2. Adam’s avatar

    If you are interested in speeding up your loops why not execute them in parallel? I use the mclapply function from the "multicore" package. Whilst I don't find the apply family of functions as intuitive to use as for loops (being an ex Matlab person as well) the speed up is pretty good.
    When you optimized the blas does it use all cores of your machine?

    Also, nice site, I am going to do something similar in australia so I will definaty be keeping track.

    Adam

    Reply

    1. QuantTrader’s avatar

      I would like to update my code to run in parallel but haven't find time for it. But I will definitely do it and post it here.
      Ad BLAS: I had few problems in installing "multicore" libraries so I'm not using them at the moment.

      And thank you for your comment.

      Reply

    2. sam’s avatar

      hi QuantTrader,

      line 28 was previously zLearn <- z[learningDates,] and in the above post z <- z[learningDates,]. So the testing data are now zLearn where it was previously z. Then I believe you you have to update line 57 as well:

      m <- lm.fit(matrix(zLearn[notNA, i], ncol = 1), zLearn[notNA, j])

      Otherway, your backtests are going to seem amazingly great...

      Reply

      1. QuantTrader’s avatar

        Hi sam!

        yes I changed it but you should look at line 31 (z <- coredata(zLearn)). It's because I didn't want to work with zoo/xts objects.

        Reply

      2. sam’s avatar

        forget my comment, I skipped line 31 z <- coredata(zLearn) ... my bad.

        Reply

      3. Stefan’s avatar

        Interestingly, I find the function "lsfit" to be even faster than "lm.fit" (despite the fact they are both using .Fortran(dqrls) internally). Maybe that helps to additionally speed up your calculation!?

        Reply

        1. QuantTrader’s avatar

          Hi Stefan,

          thank you for tip. Actually I have already found even better function: fastLm from RcppArmadillo package. It is supper fast. I am going to write about it in upcoming post but here are some results:


          test replications elapsed relative

          2 fastLm(y, x) 10 1.384 1.000000

          1 lm.fit(x, y) 10 5.529 3.994942

          4 lsfit(x, y) 10 5.900 4.263006

          5 lm(y ~ . + 0, data = A) 10 20.842 15.059249

          3 lm(y ~ x + 0) 10 23.637 17.078757

          As you can see the results are amazing!

          Reply

        2. Michael’s avatar

          I would only consider pairs that are from the same economic sector. In the loop above if you only test for cointegration stocks that are from the same sector, instead of testing all possible pairs, that will speed up calculations a lot.

          The sector information for stocks in S&P500 is published at
          http://www.s-p-500.com/

          Reply

          1. QuantTrader’s avatar

            Hello Michael,

            thank you for recommendation. Actually, I was thinking about this earlier but decided to do it this way and see whether the best candidates will be from the same sector. But later on I will modify my code and will look for co-integrated pairs from the same sectors.
            Thanks for the link.

            Reply

          2. CeePlusPlus’s avatar

            While I find R to be nice way to crank out cool graphics and prototype models quickly, it certainly isn't the right platform when you need speed (or RAM). Writing some custom code using a C extension (and a code profiler) could possibly get this down to seconds rather than 20 min. Its a pain to do, but as always there is a cost/benefit to consider.

            Reply

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.