TODO: an interesting paper with important references:

Word embedding is an assignment of a vector to each word in a language: $ W: words \rightarrow \mathbb{R}^n $. Typically, the assignment is learned from a large corpus, a vector is dense and has a relatively small dimensionality (for example, 200 to 500) compared to distributional semantics models.

Good practices to train word embeddings: see Lai et al. (2016)[1]:

  1. "First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results.
  2. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large.
  3. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding"

Characteristics Edit

Proximity of similar words Edit


t-SNE visualizations of word embeddings. Left: Number Region; Right: Jobs Region. From Turian et al. (2010)[2], see complete image.

Words in high-dimensional space tend to form clusters of related meaning and synonymous words are closest to each other.

Algebraic relation Edit

Some simple relations are found to be represented by a constant different vectors across pairs of words. For example:

$ W(\textrm{woman}) - W(\textrm{man}) \approx W(\textrm{queen}) - W(\textrm{king}) $

Similar observations were made for capital-country, celebrity-job, president-country, chairman-company,...[3]

Basis or the usage of sub-word features Edit

TODO: Bian (2014)[4], Qing Cui et al. (2014)[5].

Sources of information Edit

Text Edit


Knowledge graph Edit

Many methods combine text and knowledge graph to get better word embeddings: retrofitting (Faruqui et al. 2015)[6], Liu et al. (2016)[7], Xu et al. (2014)[8]


  • M. Yu, M. Dredze, Improving lexical embeddings with semantic knowledge., in: ACL (2), 2014, pp. 545–550.
  • J. Bian, B. Gao, T.-Y. Liu, Knowledge-powered deep learning for word embedding, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 132– 148.
  • C. Xu,Y. Bai, J. Bian, B. Gao, G.Wang, X. Liu, T.-Y. Liu, Rc-net:Ageneral framework for incorporat- ing knowledge into word representations, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 1219–1228.
  • Q. Liu, H. Jiang, S. Wei, Z.-H. Ling, Y. Hu, Learning semantic word embeddings based on ordinal knowledge constraints, in: Proceedings of ACL, 2015, pp. 1501–1511.
  • [9][10]

TODO: comparisons between methods???

Retrofitting Edit

Faruqui et al. (2015)[6]: "we first train the word vectors independent of the information in the semantic lexicons and then retrofit them".

Inequality Edit

From Liu et al. (2016): "the knowledge constraints are formulized as semantic similarity inequalities between two word pairs...

semantic inequalities from WordNet: 1) Similarities between a word and its synonymous words are larger than similarities between the word and its antonymous words. A typical example is similarity(happy, glad) > similarity(happy, sad). 2) Similarities of words that belong to the same semantic category would be larger than similarities of words that belong to different categories. 3) Similarities between words that have shorter distances in a semantic hierarchy should be larger than similarities of words that have longer distances."

Models Edit

  • CBOW
  • Skip-gram
  • CLOW (continuous list of words): Trask et al. 2015[11]
  • PENN (partitioned embedding neural network): Trask et al. 2015[11]

Applications Edit

Adding word embeddings to a system Edit


  • Chunking: Turian et al. (2010)[12]: "We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accu- racy of these baselines."
  • NER: Turian et al. (2010)[12]


  • Dependency parsing: Bansal et al. (2014)[13]: "We compare several popular embeddings to Brown clusters, via multiple types of features, in both news and web domains. We find that all embeddings yield significant parsing gains, including some recent ones that can be trained in a fraction of the time of others."
  • Named-entity recognition:
    • Tweets: Cherry and Guo (2015)[14]: "we build Brown clusters and word vectors, enabling generalizations across distributionally similar words [...] Taken all together, we establish a new state-of-the-art on two common test sets"
    • CVs: Tosik et al. (2015)[15]: "The best results on the ex- traction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features."
  • Implicit discourse relation classification: Braud and Denis (2015)[16]: "Our main finding is that denser represen- tations systematically outperform sparser ones and give state-of-the-art performance or above without the need for additional hand-crafted features."
  • Word sense disambiguation: Iacobacci (2016)[17]: "We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of- the-art WSD system that incorporates sev- eral standard WSD features."

Word embeddings with a neural network Edit

Evaluation Edit

Most intrinsic evaluation datasets fail to predict extrinsic performance, except SimLex-999 (Chiu et al. 2016)[18]. Rogers et al (2018)[19] study the relationship of intrinsic factors with performance on many different tasks.

Intrinsic evaluation Edit


  • Wordsim-353 (Finkelstein et al. 2001), MC-30 (Miller and Charles 1991), RG-65: small, old datasets that shouldn't be used any more. They also mix up similarity and relatedness.
  • WS-Rel and WS-Sim (Agirre et al. 2009)
  • MEN (Bruni et al. 2012)
  • SimLex-999 (Hill et al. 2015)

Extrinsic evaluation Edit

Nayak et al. (2016)[20] proposed a suit of tasks to evaluate word embeddings.

External links Edit

Source code Edit

  • Retrofitting: github
  • gensim implementation of Word2vec

References Edit

  1. Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems.
  2. Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 384-394). Association for Computational Linguistics. PDF
  3. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  4. Bian, J., Gao, B., & Liu, T. Y. (2014). Knowledge-powered deep learning for word embedding. In Machine Learning and Knowledge Discovery in Databases (pp. 132-148). Springer Berlin Heidelberg.
  6. 6.0 6.1 Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting Word Vectors to Semantic Lexicons. In NAACL 2015 (pp. 1606–1615). Denver, Colorado: ACL.
  7. Liu, Q., Jiang, H., Ling, Z.-H., Zhu, X., Wei, S., & Hu, Y. (2016). Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge.
  8. Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., & Liu, T. Y. (2014). RC-NET: A General Framework for Incorporating Knowledge into Word Representations.
  9. Weston, Jason, et al. "Connecting language and knowledge bases with embedding models for relation extraction." arXiv preprint arXiv:1307.7973 (2013).
  10. Yu, Mo, and Mark Dredze. "Improving lexical embeddings with semantic knowledge." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 2. 2014.
  11. 11.0 11.1 Trask, A., Gilmore, D., & Russell, M. (2015). Modeling Order in Neural Word Embeddings at Scale. arXiv preprint arXiv:1506.02338. PDF
  12. 12.0 12.1 Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. Proceedings Ofthe 48th Annual Meeting Ofthe Association for Computational Linguistics, 384–394.
  13. Bansal, M., Gimpel, K., & Livescu, K. (2014). Tailoring continuous word representations for dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 809-815).
  14. Cherry, C., Guo, H., & Canada, C. (2015). The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition. Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, (2004), 735–745.
  15. Tosik, M., Rotaru, M., Goossen, G., & Hansen, C. L. (2015). Word Embeddings vs Word Types for Sequence Labeling : the Curious Case of CV Parsing. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 123–128.
  16. Braud, C., & Denis, P. (2015). Comparing word representations for implicit discourse relation classification. Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2201–2211.
  17. Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for Word Sense Disambiguation: An Evaluation Study. In ACL 2016 (pp. 897–907). Association for Computational Linguistics.
  18. Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, 1–6.
  19. Rogers, A., Ananthakrishna, S. H., & Rumshisky, A. (2018). What’s in Your Embedding, And How It Predicts Task Performance. Proceedings of the 27th International Conference on Computational Linguistics, 2690–2703.
  20. Nayak, N., Angeli, G., & Manning, C. D. (2016). Evaluating Word Embeddings Using a Representative Suite of Practical Tasks. Proceedings of the 1st Workshop on Evaluating Vector-Space          Representations for NLP, (2014), 19–23.
Community content is available under CC-BY-SA unless otherwise noted.