The simplest approach in NLP is looking for words. Because of its simplicity and small resource requirement, it has always been very common, for example best search engines are word-based.
Some projects employing word-based approach, according to Cambria & White (2014)[1]:
- Ortony’s Affective Lexicon (Ortony, Clore, & Collins, 1988), which groups words into affective categories
- Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1994), a corpus consisting of over 4.5 million words of American English annotated for part-of-speech (POS) information
- PageRank (Page, Brin, Motwani, & Winograd, 1999), the famous ranking algorithm of Google
- LexRank (GÜnes & Radev, 2004), a stochastic graph-based method for computing relative importance of textual units for NLP
- TextRank (Mihalcea & Tarau, 2004), a graph-based ranking model for text processing, based on two unsupervised methods for keyword and sentence extraction
Drawback
- Reliance on surface features: a document about dogs may not use the word "dog" because specific bread names are used.
References
- ↑ Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48-57.