python - Word2Vec and Gensim parameters equivalence -
gensim optimized python port of word2vec (see http://radimrehurek.com/2013/09/deep-learning-with-word2vec-and-gensim/)
i using these vectors: http://clic.cimec.unitn.it/composes/semantic-vectors.html
i going rerun model training gensim because there noisy tokens in models. find out equivalent parameters word2vec
in gensim
and parameters used word2vec
are:
- 2-word context window, pmi weighting, no compression, 300k dimensions
what gensim equivalence when train word2vec model?
is it:
>>> model = word2vec(sentences, size=300000, window=2, min_count=5, workers=4)
is there pmi weight option in gensim?
what default min_count used in word2vec?
there's set of parameters word2vec such:
- 5-word context window, 10 negative samples, subsampling, 400 dimensions.
is there negative samples parameter in gensim?
what parameter equivalence of subsampling in gensim?
the paper link compares word embeddings number of schemes, including continuous bag of words (cbow). cbow 1 of models implemented in gensim's "word2vec" model. paper discusses word embeddings obtained singular value decomposition various weighting schemes, involving pmi. there no equivalence between svd , word2vec, if want svd in gensim, it's called "lsa" or "latent semantic analysis" when done in natural language processing.
the
min_count
parameter set 5 default, can seen here.negative sampling , hierarchical softmax 2 approximate inference methods estimating probability distribution on discrete space (used when normal softmax computationally expensive). gensim's
word2vec
implements both. uses hierarchical softmax default, can use negative sampling setting hyperparameternegative
greater zero. documented in comments in gensim's code here well.
Comments
Post a Comment