Zoo Slows Down Your Linear Model Function

I was a bit frustrated when I read Aris's comment to this post about speed of his calculations in Matlab. So I changed the time span of my dataset to 5 years and repeated the whole code. It was VERY disappointing to get the results after more than 5 hours! That's really REALLY bad! I migrated from Matlab to R because it is FREE and has wide community support.. but this? What's wrong?

I tried to find the most time consuming piece of code and after a few try-and-fails I finally identified the Bugbear in my code:

bugbear

Bugbear

 

Alright, that's me when I saw the slow piece of code.. exactly these line:

 

View Code RSPLUS
1
m <- lm(z[, j] ~ z[, i] + 0)

 

The thing is that the variable z is a ZOO object! When I used coredata on it before the big for-loop I achieved MUCH better results: before ~ 5 hours, now ~ 30 min. That's a huge and significant difference. Moreover, in the 5 hours was JUST the code from the first part of cointegration tutorial and in the 30 minutes were both first and second part (or to better say their combination).

Do you have experienced similar issues when you used zoo objects?

I am going to try run this code using the Revolution Analytics version of R.

Tags: , ,

  1. QuantTrader’s avatar

    Hi gappy300 and G. Grothendieck,

    thank you for your insightful comments. I have already implemented your suggestions and have obtained again a bit better results.

    First of all, I have changed lm(y ~ x + 0) for lm.fit(y, x). This required to change a bit of my code because lm.fit does not handle NA data. But after this change the results were here after ~ 25 minutes (before 30 minutes).

    Because I installed the pre-compiled Linux distribution of R I did not have BLAS and LAPACK optimized. So I reinstalled my R from the source code and run my code again. The result? ~20 minutes.

    Lastly, running my code in parallel. This is a little bit tricky because I have changed my code and now it has one drawback thanks to which it cannot be run in parallel. But I will look at it, change it accordingly and post my results.

    Thank you again! By the way.. I did not intend to compare the Matlab and R, I was just curious why is my code so slow. I was 100% sure it was because of my mistake :)

    gappy3000, what exactly do you suggest to correct? Do you have some specific paper on your mind? Thanks!

    Reply

    1. gappy3000’s avatar

      > gappy3000, what exactly do you suggest to correct? Do you have some specific paper on your mind? Thanks!

      Saying it bluntly, performing pairwise comparisons is shameless data snopping, no matter how you try to correct it. Say you're testing at the 5% significance level. If you don't apply a correction, you'll end up with hundreds of false positives. A naive Bonferroni correction will result in a significance level of 4e-5%. I'd be surprised if you reject the null at all, but that is too conservative a test. Neither is appropriate/justified to take the "best" pairs (as per s.j. of any test) and backtest them. I suggest you start with this recent paper (http://www-stat.stanford.edu/~romano/fwe2.pdf), and go back to the references in the paper (White, Benjamini-Hochberg), as well as the references that quote it.

      Reply

    2. G. Grothendieck’s avatar

      Suggest you carefully read the ?lm page. Below we get a reduction in elapsed time if we use na.action=NULL as recommended on the ?lm page for time series. xts has more C code than zoo and so is marginally faster here and using plain vectors is faster still. lm.fit is mentioned on ?lm but its speed advantage is not really emphasized although its much faster as gappy3000 has pointed out. Although the fastest alternative, lm.fit, gives an order of magnitude speedup I would imagine that Matlab, with its byte code compiler, is over all currently faster than R.


      > library(zoo)
      > library(xts)
      > library(rbenchmark)
      > benchmark(order = "elapsed",
      + lm.fit = lm.fit(matrix(v), v),
      + zoo = lm(z ~ z + 0),
      + coredata = lm(v ~ v + 0),
      + xts = lm(x ~ x + 0),
      + zoo.na = lm(z ~ z + 0, na.action = NULL),
      + coredata.na = lm(v ~ v + 0, na.action = NULL),
      + xts.na = lm(x ~ x + 0, na.action = NULL))
      test replications elapsed relative user.self sys.self user.child sys.child
      1 lm.fit 100 0.77 1.000000 0.75 0.00 NA NA
      6 coredata.na 100 4.96 6.441558 4.71 0.06 NA NA
      3 coredata 100 5.34 6.935065 5.20 0.00 NA NA
      7 xts.na 100 5.65 7.337662 5.48 0.01 NA NA
      5 zoo.na 100 6.55 8.506494 6.08 0.05 NA NA
      4 xts 100 6.57 8.532468 6.38 0.02 NA NA
      2 zoo 100 8.36 10.857143 8.09 0.03 NA NA
      There were 50 or more warnings (use warnings() to see the first 50)
      >
      > R.version.string
      [1] "R version 2.12.2 Patched (2011-03-13 r54787)"
      > win.version()
      [1] "Windows Vista (build 6002) Service Pack 2"
      > packageVersion("xts")
      [1] ‘0.8.0’
      > packageVersion("zoo")
      [1] ‘1.7.0’
      >

      Reply

    3. gappy3000’s avatar

      I am commenting on this just because when I see MATLAB and R compared I get defensive, even if I bitch about R's performance all the time.

      I assume you have optimized BLAS and LAPACK on your workstation. Yet, sticking to your code, I believe you can speed it up greatly, more than the 10x you get. First, for computational purposes it is always best to revert to objects of class matrix, and the functions that operate on them directly (e.g., lm.fit()). Even if you have categorical variables you can easily create appropriate matrix objects (this is what R does under the hood). Subsetting of matrices is also much faster than data frames'. Using objects derived from data.frame worsens things. Here is a quick benchmark:

      require(rbenchmark)
      require(zoo)
      x <- matrix(rnorm(5e6), ncol=5)
      beta <- matrix(1:5, ncol=1)
      y <- x %*% beta + rnorm(ncol(x), 0,5)
      A <- data.frame(y, x)
      Z <- zoo(A, Sys.Date()+1:nrow(A))

      benchmark(
      lm.fit(x, y),
      lm( y ~ ., data=A),
      lm( y ~ ., data=Z),
      columns=c("test", "replications", "elapsed", "relative"),
      order="relative",
      replications=1
      )

      test replications elapsed relative
      1 lm.fit(x, y) 1 0.73 1.000000
      2 lm(y ~ ., data = A) 1 4.14 5.671233
      3 lm(y ~ ., data = Z) 1 27.27 37.356164

      That's a 37x improvement right there. Second, your code is embarassingly parallel, and can be sped up using the package multicore. On a 4-core machine, expect another 2-3x improvement.

      Having said that, I don't believe that your design is useful. Multiple tests of of hypothesis need to be corrected. The traditional methods, such as Bonferroni, are usually too conservative; you can rest assured that you will not reject the null. The better approaches use a different design (check the papers by Herbert White and Joe Romano).

      Reply

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.