Natural Language Understanding Wiki

According to Herbelot and Marco (2017)[1], there are at least two reasons for acquiring word vectors from small data: rare but important words (the long-tail) and cognitive plausibility.


From Herbelot and Marco (2017)[1]: "One way to deal with data sparsity issues when learning word vectors is to use morphological structure as a way to overcome the lack of primary data (Lazaridou et al., 2013; Luong et al., 2013; Kisselew et al., 2015; Pad´o et al., 2016). Whilst such work has shown promising result, it is only applicable when there is transparent morphology to fall back on. Another strand of research has been started by Lazaridou et al. (2017), who recently showed that by using simple summation over the (previously learnt) contexts of a nonce word, it is possible to obtain good correlation with human judgments in a similarity task."

Herbelot and Marco (2017)[1] simply tune Word2vec (initialization, parameter choice, window size, subsampling, selective training). They only compared to the original Word2vec and summation of the embeddings of context words but not to morphological approaches. They didn't report:

  • what hyperparameters are the most useful
  • the stability of the result (which can be quite bad for small datasets)
  • compare new learning policy to original policy starting from the same initialization (i.e. sum)


There are at least two very interesting nonce datasets: one from Herbelot&Marco and another from Lazaridou et al.

Definitional nonce dataset[]

Reference: Herbelot and Marco (2017)[1]


Statistics: 700 training, 300 testing

The dataset includes first sentences from Wikipedia, such as: "vi is a screen oriented text editor originally created for the unix operating system".

The task is to approximate as close as possible vectors trained on the full Wikipedia ("gold" vector). Notice that "gold" here doesn't mean annotated by humans.


  • "granada ___ is a city and the capital of the province of ___ in the autonomous community of andalusia spain": the first and second occurrences are different, the second one shouldn't have been replaced by a slot.
  • "hove ___ is a town on the south coast of england immediately to the west of its larger neighbour brighton with which it forms the unitary authority brighton and ___": similar to the previous one but a bit more subtle. The second "Hove" is the same thing but considered as part of something bigger.
  • "southwark ___ is a district of central london and part of the london borough of ___": there seems to be a parttern here
  • "mclaren ___ racing limited trading as ___ honda is a british formula one team based at the ___ technology centre woking surrey england": name of a team and name of a region

Chimera dataset[]

Reference: Lazaridou et al. (2017)[2]


Statistics: 220 training and 110 testing

A description by Herbelot and Marco: "This dataset was specifically constructed to sim- ulate a nonce situation where a speaker encounters a word for the first time in naturally-occurring (and not necessarily informative) sentences. Each instance in the data is a nonce, associated with 2-6 sentences showing the word in context. The novel concept is created as a ‘chimera’, i.e. a mixture of two existing and somewhat related concepts (e.g., a buffalo crossed with an elephant)."


  1. 1.0 1.1 1.2 1.3 Herbelot, A., & Baroni, M. (2017). High-risk learning : acquiring new word vectors from tiny data. EMNLP 2017, 304–309.
  2. Angeliki Lazaridou, Marco Marelli, Roberto Zampar- elli, and Marco Baroni. 2013. Compositionally Derived Representations of Morphologically Complex Words in Distributional Semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL2013), pages 1517– 1526, Sofia, Bulgaria.