r - Parse multiple XBRL files stored in a zip file -
i have downloaded multiple zip files website. each zip file contains multiple html , xml extension files (~ 100k in each).
it possible manually extract files , parse them. however, able within r (if possible)
example file (sorry bit big) using code previous question - download 1 zip file
library(xml) pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html" doc <- htmlparse(pth) myfiles <- doc["//a[contains(text(),'accounts_monthly_data')]", fun = xmlattrs][[1]] fileurls <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]] dir.create("temp", "hmrccache") download.file(fileurls, destfile = file.path("temp", myfiles)) i can parse files using xbrl package if manually extract them. can done follows
library(xbrl) inst <- file.path("temp", "prod224_0004_00000121_20130630.html") out <- xbrldoall(inst, cache.dir="temp/hmrccache", prefix.out=null, verbose=t) i struggling how extract these files zip folder , parse each , say, in loop using r, without manually extracting them. tried making start, don't know how progress here. advice.
# names of files lst <- unzip(file.path("temp", myfiles), list=true) dim(lst) # 118626 # unzip , extract first file nms <- lst$name[1] # prod224_0004_00000121_20130630.html lst2 <- unz(file.path("temp", myfiles), filename=nms) i using windows 8.1
r version 3.1.2 (2014-10-31)
platform: x86_64-w64-mingw32/x64 (64-bit)
using suggestion karsten in comments, unzipped files temporary directory, , parsed each file. used snow package speed things up.
# parse 1 zip file start fls <- list.files(temp)[[1]] # unzip tmp <- tempdir() lst <- unzip(file.path(temp, fls), exdir=tmp) # parse first 10 records inst <- lst[1:10] # start parse - in parallel cl <- makecluster(parallel::detectcores()) clustercall(cl, function() library(xbrl)) # start st <- sys.time() out <- parlapply(cl, inst, function(i) xbrldoall(i, cache.dir="temp/hmrccache", prefix.out=null, verbose=t) ) stopcluster(cl) sys.time() - st (i not sure using tempdir() correctly seems save large amounts of data local\temp directory - welcome comments if have approached incorrectly, thanks).
Comments
Post a Comment