r - Parse multiple XBRL files stored in a zip file -


i have downloaded multiple zip files website. each zip file contains multiple html , xml extension files (~ 100k in each).

it possible manually extract files , parse them. however, able within r (if possible)

example file (sorry bit big) using code previous question - download 1 zip file

library(xml)  pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html" doc <- htmlparse(pth)  myfiles <- doc["//a[contains(text(),'accounts_monthly_data')]", fun = xmlattrs][[1]] fileurls <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]  dir.create("temp", "hmrccache") download.file(fileurls, destfile = file.path("temp", myfiles)) 

i can parse files using xbrl package if manually extract them. can done follows

library(xbrl)      inst <- file.path("temp", "prod224_0004_00000121_20130630.html") out <- xbrldoall(inst, cache.dir="temp/hmrccache", prefix.out=null, verbose=t) 

i struggling how extract these files zip folder , parse each , say, in loop using r, without manually extracting them. tried making start, don't know how progress here. advice.

# names of files lst <- unzip(file.path("temp", myfiles), list=true) dim(lst) # 118626  # unzip  , extract first file nms <- lst$name[1] # prod224_0004_00000121_20130630.html lst2 <- unz(file.path("temp", myfiles), filename=nms) 

i using windows 8.1

r version 3.1.2 (2014-10-31)

platform: x86_64-w64-mingw32/x64 (64-bit)

using suggestion karsten in comments, unzipped files temporary directory, , parsed each file. used snow package speed things up.

  # parse 1 zip file start   fls <- list.files(temp)[[1]]    # unzip    tmp <- tempdir()   lst <- unzip(file.path(temp, fls), exdir=tmp)    # parse first 10 records   inst <- lst[1:10]    # start parse - in parallel   cl <- makecluster(parallel::detectcores())   clustercall(cl, function() library(xbrl))    # start   st <- sys.time()    out <- parlapply(cl, inst, function(i)                                    xbrldoall(i,                                              cache.dir="temp/hmrccache",                                              prefix.out=null, verbose=t) )    stopcluster(cl)    sys.time() - st 

(i not sure using tempdir() correctly seems save large amounts of data local\temp directory - welcome comments if have approached incorrectly, thanks).


Comments

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -