python - Word2Vec and Gensim parameters equivalence -
gensim optimized python port of word2vec (see http://radimrehurek.com/2013/09/deep-learning-with-word2vec-and-gensim/)
i using these vectors: http://clic.cimec.unitn.it/composes/semantic-vectors.html
i going rerun model training gensim because there noisy tokens in models. find out equivalent parameters word2vec in gensim
and parameters used word2vec are:
- 2-word context window, pmi weighting, no compression, 300k dimensions
what gensim equivalence when train word2vec model?
is it:
>>> model = word2vec(sentences, size=300000, window=2, min_count=5, workers=4) is there pmi weight option in gensim?
what default min_count used in word2vec?
there's set of parameters word2vec such:
- 5-word context window, 10 negative samples, subsampling, 400 dimensions.
is there negative samples parameter in gensim?
what parameter equivalence of subsampling in gensim?
the paper link compares word embeddings number of schemes, including continuous bag of words (cbow). cbow 1 of models implemented in gensim's "word2vec" model. paper discusses word embeddings obtained singular value decomposition various weighting schemes, involving pmi. there no equivalence between svd , word2vec, if want svd in gensim, it's called "lsa" or "latent semantic analysis" when done in natural language processing.
the
min_countparameter set 5 default, can seen here.negative sampling , hierarchical softmax 2 approximate inference methods estimating probability distribution on discrete space (used when normal softmax computationally expensive). gensim's
word2vecimplements both. uses hierarchical softmax default, can use negative sampling setting hyperparameternegativegreater zero. documented in comments in gensim's code here well.
Comments
Post a Comment