python - Word2Vec and Gensim parameters equivalence -


gensim optimized python port of word2vec (see http://radimrehurek.com/2013/09/deep-learning-with-word2vec-and-gensim/)

i using these vectors: http://clic.cimec.unitn.it/composes/semantic-vectors.html

i going rerun model training gensim because there noisy tokens in models. find out equivalent parameters word2vec in gensim

and parameters used word2vec are:

  • 2-word context window, pmi weighting, no compression, 300k dimensions

what gensim equivalence when train word2vec model?

is it:

>>> model = word2vec(sentences, size=300000, window=2, min_count=5, workers=4) 

is there pmi weight option in gensim?

what default min_count used in word2vec?

there's set of parameters word2vec such:

  • 5-word context window, 10 negative samples, subsampling, 400 dimensions.

is there negative samples parameter in gensim?

what parameter equivalence of subsampling in gensim?

  1. the paper link compares word embeddings number of schemes, including continuous bag of words (cbow). cbow 1 of models implemented in gensim's "word2vec" model. paper discusses word embeddings obtained singular value decomposition various weighting schemes, involving pmi. there no equivalence between svd , word2vec, if want svd in gensim, it's called "lsa" or "latent semantic analysis" when done in natural language processing.

  2. the min_count parameter set 5 default, can seen here.

  3. negative sampling , hierarchical softmax 2 approximate inference methods estimating probability distribution on discrete space (used when normal softmax computationally expensive). gensim's word2vec implements both. uses hierarchical softmax default, can use negative sampling setting hyperparameter negative greater zero. documented in comments in gensim's code here well.


Comments

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

java - How to filter a backspace keyboard input -

java - Show Soft Keyboard when EditText Appears -