The difficulties one faces when performing NER for speech are well-known (DARPA, 1998).

Problem Edit

A difference from text: there's no capitalization. This is somewhat similar to German NER, which is known to be harder (Faruqui et al. 2010)[1]

Hatmi et al. (2013)[2]: incorporating NER into speech recognition

"Recognizing named entities in automatic transcripts is difficult due to the presence of transcription errors and the absence of some important NER clues, such as capitalization and punctuation. In this paper, we describe a methodology for speech NER which consists of incorporating NER into the ASR process so that the ASR system generates transcripts annotated with named entities."
Another problem: OOV. From Cheng et al. (2015)[3]:
"Most spoken language processing or dialogue systems are based on a finite vocabulary, so occasionally a word used will be out of the vocabulary (OOV), in which case the automatic speech recognition (ASR) system chooses the best matching in-vocabulary sequence of words to cover that region (where acoustic match dominates the decision). The most difficult OOV words to cover are names, since they are less likely to be covered by morpheme-like subword fragments and they often result in anomalous recognition output, e.g.
REF: what can we get at Litanfeeth
HYP: what can we get it leaks on feet
While these errors are rare, they create major problems for language processing, since names tend to be important for many applications. Thus, it is of interest to automatically detect such error regions for additional analysis or human correction."
From DARPA, 1998:
"... spelling numbers out as words, and upper case in SNOR (Speech Normalized Orthographic Representation) format.

Approaches Edit

Incorporating ASR features into NER Edit

Hatmi et al. (2013)[2]: "The first is to incorporate ASR features into the NER tagger. In [7], an ASR confidence feature is employed to indicate whether each word has been correctly recognized. Automatic transcriptions tagged with named entities are used to model ASR errors. The goal is to reject named entities with ASR errors thereby increasing NER precision. Experiments show a gain in precision of 7.46 %. Recent work [8] has proposed to include features indicative OOV words. A CRF-based tagger exploits the output of an OOV detector in order to identify and ignore regions containing incorrectly transcribed named entities. This allows an improvement in F-measure from 58.5 to 60.7 %.

Consider more alternatives Edit

Hatmi et al. (2013)[2]: "The second approach consists of exploiting intermediate ASR outputs in order to broaden the search space. In [9], an NER system based on maximum entropy is used to annotate the N-best ASR hypothesis. Then a weighted voting based on ASR and NER scores is made to select the most probable named entities, even if they do not occur in the 1-best ASR hypothesis. Experimental results show an improvement of 1.7 % in F-measure. Other work [10] has proposed directly to recognize named entities in the word lattice. The used named entity grammars integrate the words belonging to the ASR lexicon and exploit the whole ASR word lattice in order to extract the N-best list of named entity hypotheses. The ASR and NER scores are attached to each named entity hypothesis. Experimental results show an improvement of 1 % in F-measure."

Annotating NER at ASR level Edit

Hatmi et al. (2013)[2]: "The third approach consists of annotating named entities at the ASR level by using an extremely large vocabulary lexicon [11]. Named entities are incorporated as compound words into the lexicon and the language model. This considerably increases the size of the vocabulary (1.8 million words). A one-pass ASR system is used to transcribe the annotated named entities. 500 Japanese spoken queries for a question-answering system are used for evaluation. Results shows an improvement of 2.4 % in F-measure."

Datasets Edit

ESTER 2 (French) Edit

Hatmi et al. (2013)[2]:

  • "The audio resources containing 26 French broadcasts, recorded from January to February 2008. Most of these are the news from four different sources: France Inter, Radio France International (RFI), Africa 1 and TVME.
  • The textual resources consisting of manual transcriptions of audio resources (72,534 words). Named entities were annotated manually according to a taxonomy consisting of 7 main categories: Person, Location, Organization, Human Product, Amount, Time and Function. There are 5,123 named entities in these manual transcriptions."

BOLT (Chinese and Arabic) Edit

DARPA Broad Operational Language Translation (BOLT) program provides conversational data in Chinese and Arabic. It's annotated with coreference (among other phenomena) but not named-entity classes.

"Annotators link together names, pronouns and definite descriptions that refer to the same entity, providing information that is crucial for systems doing semantic interpretation. Noun phrase mentions of events are also linked to verb phrases that describe the event."

Hub-4 Edit

From NIST (1998): "The Hub-4 Broadcast News Evaluation this year will include a new "Information Extraction" Spoke which will involve the implementation and evaluation of automatic Named Entity tagging as applied to the Hub-4 Reference and recognizer-produced transcriptions. This new spoke is based on the Message Understanding Conference (MUC) Named Entity task which involved the tagging of person, organization, location names and other entities in newswire text."

REPERE (French) Edit

The REPERE dataset concerns about "people recognition", including the recognition of person names in speech (Kahn et al, 2012)[4].

ETAPE (French) Edit

From Hatmi et al. (2013)[5]:

"The ETAPE evaluation campaign aimed to measure the performance of speech technologies for the French language [10]. Three main tasks were considered in this campaign: segmentation, transcription and information extraction."

NER with OOV for speech Edit

Constructed by Can et al. (2009)[6] and later annotated with NER. From Parada et al. (2011)[7]:

"To focus attention on the OOV problem, we used the data set constructed by Can et al. [16], originally designed to evaluate Spoken Term Detection (STD) of OOVs (OOVCORP.) The corpus contains 100 hours of transcribed English broadcast news speech emphasizing OOVs.
The 48 hours set aside for named entity training and evaluation did not contain named entity annotations. We annotated this data using Amazon Mechanical Turk (MTurk)"

Text corpora Edit

Some of them contain transcribed conversation with annotated named entities for example OntoNotes.

References Edit

  1. Faruqui, Manaal, Sebastian Padó, and Maschinelle Sprachverarbeitung. "Training and Evaluating a German Named Entity Recognizer with Semantic Generalization." KONVENS. 2010.
  2. 2.0 2.1 2.2 2.3 2.4 Hatmi, Mohamed, et al. "Incorporating named entity recognition into the speech transcription process." Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech'13). 2013.
  3. Cheng, Hao, Hao Fang, and Mari Ostendorf. "Open-Domain Name Error Detection using a Multi-Task RNN." EMNLP. 2015.
  4. Kahn, Juliette, et al. "A presentation of the REPERE challenge." Content-Based Multimedia Indexing (CBMI), 2012 10th International Workshop on. IEEE, 2012.
  5. Hatmi, Mohamed, et al. "Named Entity Recognition in Speech Transcripts following an Extended Taxonomy." SLAM@ INTERSPEECH. 2013.
  6. Dogan Can et. al, “Effect of pronounciations on OOV queries in spoken term detection,” ICASSP, 2009.
  7. Parada, Carolina, Mark Dredze, and Frederick Jelinek. "OOV Sensitive Named-Entity Recognition in Speech." INTERSPEECH. 2011.
Community content is available under CC-BY-SA unless otherwise noted.