Natural Language Understanding Wiki

Main idea: try to allocate similar vectors to similar words from the start (see Goldberg (2015)[1], Sections 5.1-5.3).

Chen and Manning (2014)[2]: Figure 4 (middle) shows that using pre-trained word embeddings can obtain around 0.7% improvement on PTB and 1.7% improvement on CTB, compared with using random initialization within (−0.01, 0.01)."

Lebret et al. (2015)[3]: "This task confirms the importance of embedding fine-tuning for NLP tasks with a high semantic component. We note that our tuned embeddings leads to a performance gain of about 1% or 2% for NER, while the gain is between about 4% and 8% for the movie review."

Pei et al. (2014)[4]: "Previous work found that the performance can be improved by pre-training the character embeddings on large unlabeled data and using the obtained embeddings to initialize the character lookup table instead of random initialization (Mansur et al., 2013; Zheng et al., 2013). [...] We pre-train the embeddings on the Chinese Giga-word corpus (Graff and Chen, 2005). As shown in Table 5 (last three rows), both the F-score and OOV recall of our model boost by using pre-training."

See also[]

Re-embedding words (Labutov and Lipson, 2013)[5]: initial ideas when people were starting to explore word embeddings, not used often later.


  1. Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language Processing, 1–76.
  2. Chen, Danqi, and Christopher D. Manning. "A Fast and Accurate Dependency Parser using Neural Networks." EMNLP. 2014.
  3. Lebret, Rémi, Joël Legrand, and Ronan Collobert. Is deep learning really necessary for word embeddings?. No. EPFL-REPORT-196986. Idiap, 2013.
  4. Pei, Wenzhe, Tao Ge, and Baobao Chang. "Max-Margin Tensor Neural Network for Chinese Word Segmentation." ACL (1). 2014.
  5. Labutov, Igor, and Hod Lipson. "Re-embedding words." ACL (2). 2013. PDF