nltk - How to cycle through the files in a corpus: Python -

July 15, 2010

i have other methods need work each individual txt file within corpus. how can cycle through them?

import nltk nltk.corpus import plaintextcorpusreader pcr  def main():     cor = corpus()     # every text file in corpus:         #do method  def corpus():     corpus_root='corpus/'     corp = pcr(corpus_root,'.*\.txt')     corp = corp.raw()     return corp  main()

the nltk corpus readers have method fileids() should use:

mycorpus = pcr(corpus_root, r'.*\.txt')  fname in mycorpus.fileids():     text = mycorpus.raw(fname)     sents = mycorpus.sents(fname)     # or whatever

when call raw(), sents() words(), tagged_words(), etc. filename, contents of file specify. can pass list of filenames, if ever want multi-file subset of corpus.

ps. doesn't make difference here, should use raw strings regexps (see above)

Search This Blog

Ruby Code

nltk - How to cycle through the files in a corpus: Python -

Comments

Post a Comment

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -