gensim - combining LSA/LSI with Naive Bayes for document classification -


i'm new gensim package , vector space models in general, , i'm unsure of what should lsa output.

to give brief overview of goal, i'd enhance naive bayes classifier using topic modeling improve classification of reviews (positive or negative). here's great paper i've been reading has shaped ideas left me still confused implementation..

i've got working code naive bayes--currently, i'm using unigram bag of words features , labels either positive or negative.

here's gensim code

from pprint import pprint # pretty printer import gensim gs  # tutorial sample documents docs = ["human machine interface lab abc computer applications",               "a survey of user opinion of computer system response time",               "the eps user interface management system",               "system , human system engineering testing of eps",               "relation of user perceived response time error measurement",               "the generation of random binary unordered trees",               "the intersection graph of paths in trees",               "graph minors iv widths of trees , quasi ordering",               "graph minors survey"]   # stoplist removal, tokenization stoplist = set('for of , in'.split()) # each document: lowercase document, split whitespace, , add words not in stoplist texts texts = [[word word in doc.lower().split() if word not in stoplist] doc in docs]   # create dict dict = gs.corpora.dictionary(texts) # create corpus corpus = [dict.doc2bow(text) text in texts]  # tf-idf tfidf = gs.models.tfidfmodel(corpus) corpus_tfidf = tfidf[corpus]  # latent semantic indexing 10 topics lsi = gs.models.lsimodel(corpus_tfidf, id2word=dict, num_topics =10)  in lsi.print_topics():     print 

here's output

0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer" 0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi" -0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing" 0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error" -0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management" 

any suggestions or general comments appreciated.

just started working on same problem, svm instead, afaik after training model need this:

new_text = 'here document' text_bow = dict.doc2bow(new_text) vector = lsi[text_bow] 

where vector topic distribution in document, length equal number of topics choose training, 10 in case. need represent documents topic distributions , feed them classification algorithm.

p.s. know it's kind of old question, keep seeing in google results every time searching )


Comments

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -