nltk - How to cycle through the files in a corpus: Python -
i have other methods need work each individual txt file within corpus. how can cycle through them?
import nltk nltk.corpus import plaintextcorpusreader pcr def main(): cor = corpus() # every text file in corpus: #do method def corpus(): corpus_root='corpus/' corp = pcr(corpus_root,'.*\.txt') corp = corp.raw() return corp main()
the nltk corpus readers have method fileids() should use:
mycorpus = pcr(corpus_root, r'.*\.txt') fname in mycorpus.fileids(): text = mycorpus.raw(fname) sents = mycorpus.sents(fname) # or whatever when call raw(), sents() words(), tagged_words(), etc. filename, contents of file specify. can pass list of filenames, if ever want multi-file subset of corpus.
ps. doesn't make difference here, should use raw strings regexps (see above)
Comments
Post a Comment