From Chang et al. (2015): learning to search "solves the problem by:

1. converting structured prediction into a search problem with specified search space and actions;
2. defining structured features over each state to capture the interdependency between output variables;
3. constructing a reference policy based on training data;
4. learning a policy that imitates the reference policy."

## Computational properties

SEARN originally stresses the ability to apply standard classifier:

From Daum (2006): "Searn operates by transforming structured prediction problems into a collection of classification problems, to which any standard binary classifier may be applied (for instance, a support vector machine or decision tree). In fact, Searn represents a family of structured prediction algorithms depending on the classifier and search space used."

However, this conceptually advantage quickly turns into a drawback when one converts the cost-sensitive training examples into standard training examples. The number of standard training examples scales quadratically with the number of classes (using Weighted-All-Pair reduction, see Daum (2006), page 28). For example, in a labeled dependency parsing problem using Stanford formalism, there are about 90 possible transitions, corresponding to 90 classes. The number of combinations becomes too large.

Later development avoids this problem by using a regressor. Chang et al. (2015, Supplementary material, Section B) use what they call "Cost-Sensitive One Against All (CSOAA) classification technique", that is a regressor outputting a cost for each class. The class that receives the smallest cost is output as the "classifier" prediction. They train this regressor to predict the given costs, using ridge regression or squared loss.

Another problem is with computing the costs themselves. This is an expensive operation since it involve running a policy until terminated. For a derivation of T transitions, training with regular supervised learning takes O(T). Learning to search would require going through each transition (there are T) and roll-out (run the policy till the end, takes O(T) steps) for each possible action (there are L). Therefore the whole procedure takes O(T^2L). If T ~ 20, L ~ 90 then it is 1800 times slower than supervised training. Straka et al. (2015) alleviate this problem by ignoring some transitions (effectively reducing L) but it's questionable whether this trick can be applied to other problems.

"As many transitions differ only in the label of the arc being added, to improve oracle speed, we employ the following heuristic: when choosing a transition to follow, we consider only those arc-adding transitions that assign the label appearing in the gold tree. This effectively reduces the number of possible transitions from tens to at most five (e.g., from 96 to 4 transitions in the swap system for English)."

Notice: Daum (2006) says training with SEARN takes O(TLk) but I doubt that. I'm not sure what k is so ignore it for now. O(TL) is only to compute the cost of one action while we have O(T) actions for each sentence. So the overall complexity should be in the range of O(T^2L).

"the complexity of training Searn for sequence labeling scales as O(TLk) where T is the sequence length, L is the number of labels and k is the Markov order on the features."

## Applications

Dependency parsing (Chang et al. 2015).

AMR parsing (Goodman et al., 2016).