## FANDOM

312 Pages

TODO: an interesting paper with important references: https://arxiv.org/pdf/1702.01417.pdf

Word embedding is an assignment of a vector to each word in a language: $W: words \rightarrow \mathbb{R}^n$. Typically, the assignment is learned from a large corpus, a vector is dense and has a relatively small dimensionality (for example, 200 to 500) compared to distributional semantics models.

Good practices to train word embeddings: see Lai et al. (2016)[1]:

1. "First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results.
2. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large.
3. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding"

## Characteristics Edit

### Proximity of similar words Edit

Words in high-dimensional space tend to form clusters of related meaning and synonymous words are closest to each other.

### Algebraic relation Edit

Some simple relations are found to be represented by a constant different vectors across pairs of words. For example:

$W(\textrm{woman}) - W(\textrm{man}) \approx W(\textrm{queen}) - W(\textrm{king})$

Similar observations were made for capital-country, celebrity-job, president-country, chairman-company,...[3]

## Basis or the usage of sub-word features Edit

TODO: Bian (2014)[4], Qing Cui et al. (2014)[5].

## Sources of information Edit

Word2vec

### Knowledge graph Edit

Many methods combine text and knowledge graph to get better word embeddings: retrofitting (Faruqui et al. 2015)[6], Liu et al. (2016)[7], Xu et al. (2014)[8]

TODO:

• M. Yu, M. Dredze, Improving lexical embeddings with semantic knowledge., in: ACL (2), 2014, pp. 545–550.
• J. Bian, B. Gao, T.-Y. Liu, Knowledge-powered deep learning for word embedding, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 132– 148.
• C. Xu,Y. Bai, J. Bian, B. Gao, G.Wang, X. Liu, T.-Y. Liu, Rc-net:Ageneral framework for incorporat- ing knowledge into word representations, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 1219–1228.
• Q. Liu, H. Jiang, S. Wei, Z.-H. Ling, Y. Hu, Learning semantic word embeddings based on ordinal knowledge constraints, in: Proceedings of ACL, 2015, pp. 1501–1511.
• [9][10]

TODO: comparisons between methods???

#### Retrofitting Edit

Faruqui et al. (2015)[6]: "we first train the word vectors independent of the information in the semantic lexicons and then retrofit them".

#### Inequality Edit

From Liu et al. (2016): "the knowledge constraints are formulized as semantic similarity inequalities between two word pairs...

semantic inequalities from WordNet: 1) Similarities between a word and its synonymous words are larger than similarities between the word and its antonymous words. A typical example is similarity(happy, glad) > similarity(happy, sad). 2) Similarities of words that belong to the same semantic category would be larger than similarities of words that belong to different categories. 3) Similarities between words that have shorter distances in a semantic hierarchy should be larger than similarities of words that have longer distances."

## Models Edit

• CBOW
• Skip-gram
• CLOW (continuous list of words): Trask et al. 2015[11]
• PENN (partitioned embedding neural network): Trask et al. 2015[11]

## Applications Edit

### Adding word embeddings to a system Edit

Pre-word2vec:

• Chunking: Turian et al. (2010)[12]: "We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accu- racy of these baselines."
• NER: Turian et al. (2010)[12]

Post-word2vec:

• Dependency parsing: Bansal et al. (2014)[13]: "We compare several popular embeddings to Brown clusters, via multiple types of features, in both news and web domains. We find that all embeddings yield significant parsing gains, including some recent ones that can be trained in a fraction of the time of others."
• Named-entity recognition:
• Tweets: Cherry and Guo (2015)[14]: "we build Brown clusters and word vectors, enabling generalizations across distributionally similar words [...] Taken all together, we establish a new state-of-the-art on two common test sets"
• CVs: Tosik et al. (2015)[15]: "The best results on the ex- traction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features."
• Implicit discourse relation classification: Braud and Denis (2015)[16]: "Our main finding is that denser represen- tations systematically outperform sparser ones and give state-of-the-art performance or above without the need for additional hand-crafted features."
• Word sense disambiguation: Iacobacci (2016)[17]: "We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of- the-art WSD system that incorporates sev- eral standard WSD features."

## Evaluation Edit

Most intrinsic evaluation datasets fail to predict extrinsic performance, except SimLex-999 (Chiu et al. 2016)[18]. Rogers et al (2018)[19] study the relationship of intrinsic factors with performance on many different tasks.

### Intrinsic evaluation Edit

Datasets:

• Wordsim-353 (Finkelstein et al. 2001), MC-30 (Miller and Charles 1991), RG-65: small, old datasets that shouldn't be used any more. They also mix up similarity and relatedness.
• WS-Rel and WS-Sim (Agirre et al. 2009)
• MEN (Bruni et al. 2012)
• SimLex-999 (Hill et al. 2015)

### Extrinsic evaluation Edit

Nayak et al. (2016)[20] proposed a suit of tasks to evaluate word embeddings.

## Edit

Community content is available under CC-BY-SA unless otherwise noted.