hadoop - how to read multiple files from a file in Apache Pig? -
i have 1 file named "filelist.txt" , content of file list files want read pig script. example, can organized as:
file1.txt file2.txt ... filen.txt some of solutions trying use regular expression, there no particular format in filenames, thing can read filenames filelist.txt
in each of file actual data want read. example, in file1, can have:
value1 value2 value3 so how should able read these files values in pig scripts?
there no way in pure pig. best can in pure pig use builtin globbing can find information here. flexible, doesn't sound enough purposes.
the other solution can think of, if can file in local environment, use sort of wrapper (i recommend python). in script can read file , generate pig script read lines. here how logic work:
def addloads(filestoread, schema, delim='\\t'): newlines = [] open(filestoread, 'r') infile: n, f in enumerate(infile): newlines.append("input{} = load '{}' using pigstorage('{}') {};".format(n, f, delim, schema)) to_union = [ 'input{}'.format(i) in range(1, len(newlines)+1) ] newlines.append('loaded_lines = union {} ;'.format(', '.join(to_union))) return '\n'.join(newlines) append beginning of pig script load disk, , make sure rest of script uses loaded_lines start.
Comments
Post a Comment