hadoop - how to read multiple files from a file in Apache Pig? -

April 15, 2011

i have 1 file named "filelist.txt" , content of file list files want read pig script. example, can organized as:

file1.txt file2.txt ... filen.txt

some of solutions trying use regular expression, there no particular format in filenames, thing can read filenames filelist.txt

in each of file actual data want read. example, in file1, can have:

value1 value2 value3

so how should able read these files values in pig scripts?

there no way in pure pig. best can in pure pig use builtin globbing can find information here. flexible, doesn't sound enough purposes.

the other solution can think of, if can file in local environment, use sort of wrapper (i recommend python). in script can read file , generate pig script read lines. here how logic work:

def addloads(filestoread, schema, delim='\\t'):      newlines = []     open(filestoread, 'r') infile:          n, f in enumerate(infile):             newlines.append("input{} = load '{}' using pigstorage('{}') {};".format(n, f, delim, schema))      to_union = [ 'input{}'.format(i) in range(1, len(newlines)+1) ]      newlines.append('loaded_lines = union {} ;'.format(', '.join(to_union)))      return '\n'.join(newlines)

append beginning of pig script load disk, , make sure rest of script uses loaded_lines start.

Search This Blog

Ruby Code

hadoop - how to read multiple files from a file in Apache Pig? -

Comments

Post a Comment

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -