From Wang et al. (2011)[1]: "A frame-based SLU system is often limited to a specific domain, which has a well-defined, relatively small semantic space."

From Coppola et al. (2009)[2]: "The good performance achieved for Italian dialogs shows that FrameNet-based parsing is viable for labeling conversational speech in any language using a few training data."

From Favre et al. (2010)[3]: "On the speech processing side, SRL has become an important component of spoken language understanding systems[4][5]."

Challenges Edit

From Wang et al. (2011)[1]: "... challenges for spoken language understanding, including:

  • Extra-grammaticality – spoken languages are not as well-formed as written languages. People are in general less careful with speech than with writing. They often do not comply with rigid syntactic constraints.
  • Disfluencies – false starts, repairs, and hesitations are pervasive, especially in conversational speech.
  • Speech recognition errors – Speech recognition technology is far from perfect. Environment noises, speaker’s accent, domain specific terminologies, all make speech recognition errors inevitable. It is common to see that a generic speech recognizer has over 30% word error rates on domain specific data.
  • Out-of-domain utterances – a dialog system can never restrict a user from saying anything out of a specific domain, even in a system-initiated dialog, where users are prompted for answers to specific questions. Because the frame-based SLU focuses on a specific application domain, out-of-domain utterances are not well modeled and can often be confused as an in-domain utterance. Detecting the out-of-domain utterances is not an easy task – it is complicated by the extra-grammaticality, disfluencies and ASR errors of the in-domain utterances.

In summary, robustness is one of the most important issues in SLU."

Another challenge: modeling context. From Wang et al. (2011)[1]: "In practical spoken dialog systems, however, users seldom specify all the important information in a single utterance. They are often engaged in a dialogue with the system such that important pieces of information (slots) can be accumulated over multiple dialogue turns." But sometimes users also discard previous information, e.g. "the mention of a new departure and arrival city often ... signals that the user has switched to another task of finding a different flight."

Datasets Edit

  • ATIS
  • Communicator
  • MEDIA (French)
  • LUNA (multilingual)
  • OntoNotes


From Wang et al. (2011)[1]: "The ATIS corpus (Dahl et al., 1994; Hemphill et al., 1990) is one of the few data sets available in the public domain that has been broadly used by SLU researchers. The data is more realistic compared with the previous speech corpus – it is the spontaneous spoken language instead of the read speech, therefore it contains disfluencies, corrections, and colloquial pronunciations. It was collected in a normal office setting, with a “Wizard of Oz” interaction between a system and a subject who issued spoken queries for air travel information."

OntoNotes Edit

OntoNotes include both written and spoken (transcribed) data. It has been used to evaluate SRL for spoken data: Favre et al. (2010)[3].

Approaches Edit

Stehwien and Vu (2016)[6] show that prosody correlates with semantic slots, therefore carries potentially useful information.

Evaluation metrics Edit

  • Sentence/utterance level semantic accuracy (SLSA)
  • Slot error rate
  • Slot P/R/F1
  • End-to-end evaluation (extrinsic)

References Edit

  1. 1.0 1.1 1.2 1.3 Wang, Ye‐Yi, Li Deng, and Alex Acero. "Semantic Frame‐Based Spoken Language Understanding." Spoken Language Understanding: Systems for Extracting Semantic Information from Speech (2011): 41-91.
  2. Coppola, B., Moschitti, A., & Riccardi, G. (2009). Shallow Semantic Parsing for Spoken Language Understanding. Proceedings of NAACL HLT, (June), 85–88.
  3. 3.0 3.1 Favre, B., Bohnet, B., & Hakkani-Tur, D. (2010). Evaluation of semantic role labeling and dependency parsing of automatic speech recognition output. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 5342–5345.
  4. N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M. Gilbert, “TheAT&T spoken language understanding system,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 213–222, 2006.
  5. R. De Mori, F. Bechet, D. Hakkani-T¨ ur, M. McTear, G. Riccardi, and G. Tur, “Spoken Language Understanding for Conversational Sys- tems,” SPM Special Issue on Spoken Language Technologies, vol. 25, no. 3, pp. 50–58, May 2008.
  6. Stehwien, Sabrina, and Ngoc Thang Vu. "Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding." Interspeech 2016 (2016): 730-734.