Following my previous post about rewriting my code to run in parallel I have modified the code for downloading the S&P 500 prices from Yahoo to run i parallel as well. To be honest, I quite enjoy writing the code to run in parallel. It's fun for various reasons, but some theoretical background is highly recommended (e.g. here, here or here).
The good news is that it takes very short time (148 seconds) to download the data from Yahoo, but on the other hand the merge function still takes way too much time to complete. To be more specific, on average, 80% of time is spent on merge and 20% on downloading the actual data from Yahoo. It's faster than the original code but I don't like the idea of spending so much time on merge function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | rm(list = ls(all = TRUE)) require(doMC) require(quantmod) require(tseries) require(timeDate) registerDoMC(20) symbols = read.csv("/home/robo/Dropbox/work/FX/PairTrading/sp500.csv", header = F, stringsAsFactors = F) nrStocks = length(symbols[,1]) dateStart = "2007-01-01" z = foreach(i = 1:nrStocks, .combine = merge.zoo) %dopar% { cat("Downloading ", i, " out of ", nrStocks , "\n") x = get.hist.quote(instrument = symbols[i,], start = dateStart, quote = "AdjClose", retclass = "zoo", quiet = T) colnames(x) = symbols[i,1] x } z = as.xts(z) registerDoMC() |
One intuitive solution is to preallocate the memory and save the results there. However, I could't find a way how to modify a variable that is out of foreach scope when run in parallel. I understand that we could corrupt the data, but locking the modify/update would solve this issue (updating doesn't take much time). I tried to google/yahoo/duckduckgo/bing the solution but without luck. Do you know the answer?
This solution has jet another drawback.. missing data.
Then I saw one line of code where I change my data from type "zoo" into "xts". Xts is written in C, whereas zoo is written in pure R (I read some articles about intentions to merge this packages but who knows when it will be). So why not to change the variable into xts right after the download? Simple..
And the result? On average, 43 seconds!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | rm(list = ls(all = TRUE)) require(doMC) require(quantmod) require(tseries) require(timeDate) registerDoMC(20) symbols = read.csv("/home/robo/Dropbox/work/FX/PairTrading/sp500.csv", header = F, stringsAsFactors = F) nrStocks = length(symbols[,1]) dateStart = "2007-01-01" z = foreach(i = 1:nrStocks, .combine = merge.xts) %dopar% { cat("Downloading ", i, " out of ", nrStocks , "\n") x = get.hist.quote(instrument = symbols[i,], start = dateStart, quote = "AdjClose", retclass = "zoo", quiet = T) colnames(x) = symbols[i,1] x = as.xts(x) } registerDoMC() |
-
thanks! i got it to work on 2.10 32bit as well, just had to reduce the number of concurrent tasks (running this on a small laptop)
-
I haven't tested this, but it might be even faster if you stored the results of your foreach call in a list and only called merge.xts once (via do.call(merge, z)).
Also, some of the xts C code has already been merged into zoo (coredata and lag).
-
i suspect either my SP500.csv or my R version (2.10) is out of date as i get a ton of errors running this script -- can you share your R version and sp500.csv? Thanks in advance



5 comments
Comments feed for this article
Trackback link: http://blog.quanttrader.org/2012/04/download-prices-from-yahoo-in-parallel/trackback/