Natural Language Understanding Wiki
Advertisement

From Chang et al. (2015)[1]: learning to search "solves the problem by:

  1. converting structured prediction into a search problem with specified search space and actions;
  2. defining structured features over each state to capture the interdependency between output variables;
  3. constructing a reference policy based on training data;
  4. learning a policy that imitates the reference policy."

SEARN originally stresses the ability to apply standard classifier:

From Daum (2006)[2]: "Searn operates by transforming structured prediction problems into a collection of classification problems, to which any standard binary classifier may be applied (for instance, a support vector machine or decision tree). In fact, Searn represents a family of structured prediction algorithms depending on the classifier and search space used."

However, this conceptually advantage quickly turns into a drawback when one converts the cost-sensitive training examples into standard training examples. The number of standard training examples scales quadratically with the number of classes (using Weighted-All-Pair reduction, see Daum (2006)[2], page 28). For example, in a labeled dependency parsing problem using Stanford formalism, there are about 90 possible transitions, corresponding to 90 classes. The number of combinations becomes too large.

Later development avoids this problem by using a regressor. Chang et al. (2015, Supplementary material, Section B)[1] use what they call "Cost-Sensitive One Against All (CSOAA) classification technique", that is a regressor outputting a cost for each class. The class that receives the smallest cost is output as the "classifier" prediction. They train this regressor to predict the given costs, using ridge regression or squared loss.

Another problem is with computing the costs themselves. This is an expensive operation since it involve running a policy until terminated. As Daum (2006)[2] stated:

"the complexity of training Searn for sequence labeling scales as O(TLk) where T is the sequence length, L is the number of labels and k is the Markov order on the features."

In the previous example, training a labeled parser will be ~20 times more expensive than an unlabeled parser. Straka et al. (2015)[3] overcome this difficulti by ignoring some transitions but it's questionable whether this trick can be applied to other problems.

"As many transitions differ only in the label of the arc being added, to improve oracle speed, we employ the following heuristic: when choosing a transition to follow, we consider only those arc-adding transitions that assign the label appearing in the gold tree. This effectively reduces the number of possible transitions from tens to at most five (e.g., from 96 to 4 transitions in the swap system for English)."

Applications

Dependency parsing (TODO: REF), AMR parsing (Goodman et al., 2016)[4].

See also

  • Section 3 in Goodman et al. (2016)[4] -- very good summarization and explanation.

References

  1. 1.0 1.1 Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III and John Langford. Learning to search better than your teacher. International Conference on Machine Learning (ICML), 2015
  2. 2.0 2.1 2.2 Daum, H. C. (2006). Practical Structured Learning Techniques for Natural Language Processing. University of Southern California.
  3. Straka, M., Hajič, J., Straková, J., & Jan Hajič, J. (2015). Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle. In TLT14 (pp. 208–220). IPIPAN.
  4. 4.0 4.1 James Goodman, Andreas Vlachos and Jason Naradowsky. 2016. Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. PDF
Advertisement