Natural Language Understanding Wiki

Reference: Ruppenhofer et al. (2010)[1]

Extensive discussion on the dataset (published 3 year later!): Ruppenhofer et al. (2013)[2]

Data format[]


From file Semeval2010Task10TrainingFN+PB/Semeval2010Task10TrainingPB/Semeval2010.Task10.TrainingData.PB.readme:

  • The Tiger training data from Conan Doyle is  available in a conll-inspired format.
  • The 9 columns we use are ordered in the following way: sentence id, token id, word, lemma, pos, headless syntax, syntax with heads, local roles, non-local roles
  • The following example displays sentence 9 of the text:
 9	1	"	"	PUNC``	(S(SBAR(WHNP	(S:10(SBAR:2(WHNP:2	_	_
 9	2	What	What	WP	*)	*)	_	_
 9	3	's	be	VBZ	(VP	(VP:3	_	_
 9	4	the	the	DT	(NPB	(NPB:5	_	_
 9	5	matter	matter	NN	*	*	_	_
 9	6	,	,	PUNC,	*)))	*)))	_	_
 9	7	Walters	Walters	NNP	(NPB	(NPB:7	coref.01{A0_OVE=(s9_7)}	coref.01{A1_OVE=(s8_18)}
 9	8	?	?	PUNC.	*	*	_	_
 9	9	"	"	PUNC''	*)	*)	_	_
 9	10	asked	ask	VBD	(VP	(VP:10	ask.01{A0_OVE=(s9_11);A1_OVE=(s9_1,s9_2,s9_3,s9_4,s9_5,s9_6,s9_7,s9_8,s9_9);A2_DNI=(s9_7)}	ask.01{}
 9	11	Baynes	Baynes	NNP	*	*	coref.01{A0_OVE=(s9_11)}	coref.01{A1_OVE=(s6_10)}
 9	12	sharply	sharply	RB	(ADVP	(ADVP:12	_	_
 9	13	.	.	PUNC.	*)))	*)))	_	_	* 

Note the following:

  • In the syntax with heads column, the head of each non-terminal is added to the phrase type label with a colon as separator. For instance, the head of the S(entence) that opens on token 1 is token 10, which is the main verb "asked". Similarly, the head of the noun phrase that begins with token 4, is token 5.
  • Coreference annotation is provided as an "honorary" frameset coref. As with regular framesets, the local arguments appear in the 8th column and the non-local ones in the 9th. The line for token 7 "Walters" shows that there is an earlier coreferent mention of this referent in sentence 8, namely token 18 there. Since the antecedent is in a different sentence, it is captured in the non-local column.
  • Arguments are represented as the set of terminals they cover. For instance, argument A0 of the ask.01 frameset on token 10 covers terminal 11, "Baynes". Argument 1 of the same predicate covers terminals 1 through 9, that is the whole stretch of direct speech including the quote symbols.
  • Arguments carry either the marking OVE for "overt" or DNI for "definite null instantiation".  Where appropriate, the terminals of an antecedent that explicitly refers to the correct filler of the role are given as the resolution of a DNI-argument. For instance, argument 2 of "asked", the addressee of the question, is not expressed as a syntactic argument of ask but is understood to be the person Walters addressed in the direct quote. This is captured through the notation A2_DNI=(s9_7).


The quality of PropBank/NomBank annotations is not guaranteed. Ruppenhofer et al. (2013)[2] says: "we note that we have nothing to say about the quality of the PropBank/NomBank-data [...] we lacked the resources or expertise to evaluate the generated annotations on the token-level as to their quality or usefulness within the PropBank framework. Since there were no participants for the PropBank version of the SemEval task, we also did not receive any feedback on that point from researchers who might have inspected our PropBank-style training data more closely"


The number of frame types is wrong in the original paper and some following papers (e.g. Gorinski et al., 2013[3]). The test set is not any more diverse than the training set in terms of frame types. See discussion.

There are some unexpected differences between (A) (no null instatiation) test data and (B) gold data:

  • A has no coreference annotation while B does.
  • Some overt roles are missing in A, e.g. sentence 163, token 17, A misses frame assistance.01 and two overt roles (A0 and A1)

The official scorer is buggy:

In correctedTiger.PB.txt, some INI's have a filler! And when I create a file that have IN'sI without any filler, the scorer broke. There are two sentences without syntactic analysis: sentence #20 and #392 (they have NONE as their parse tree).


  1. Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C., & Palmer, M. (2010). SemEval-2010 Task 10: Linking Events and Their Participants in Discourse. In Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010 (pp. 45–50). Uppsala, Sweden.
  2. 2.0 2.1 Ruppenhofer, J., Lee-Goldman, R., Sporleder, C., & Morante, R. (2013). Beyond sentence-level semantic role labeling: Linking argument structures in discourse. Language Resources and Evaluation, 47(3), 695–721.
  3. Gorinski, P., Ruppenhofer, J., & Sporleder, C. (2013). Towards Weakly Supervised Resolution of Null Instantiations. Proceedings of IWCS, 1–11.