Natural Language Understanding Wiki

This page documents necessary steps to reproduce results of Chen & Manning (2014)[1] for English (including re-implementation) and makes explicit decisions that aren't covered in the paper.

  1. Obtain data: WSJ part of PENN Treebank. Section 02-21 for training, 22 for development, 23 for testing.
  2. NP-bracketing: No.
    • Tried to run against WSJ with and without NP-bracketing and found a difference of 3%
    • Apparently, the reported result was produced without NP-bracketing, which makes the task artificially easier and the output less meaningful.
  3. Assign POS tags using Stanford POS tagger with ten-way jackknifing of the training data
    • Reported accuracy: ≈ 97.3%
    • I used version 3.6.0 downloaded here and followed instructions in the JavaDoc.
    • Reused english-bidirectional-distsim.tagger.props. Downloaded word clusters. Fixed a crash. (Which used bidirectional5words model.)
    • Instructions say: "The part-of-speech tags used as input for training and testing were generated by the Stanford POS Tagger (using the bidirectional5words model)."
    • It's not clear how to divide the folds which can make a difference. I divide it by sentences, the accuracy is 97.18%. I also tried to divide by documents and it wasn't better.
  4. Constituent-to-dependency conversion:
    1. LTH Constituent-to-Dependency Conversion Tool
      • Downloaded pennconverter
      • The paper didn't specify command-line options or reference type of conversion
      • Head-finding rules matter
      • The default is CoNLL-2008 conventions and CoNLL-X file format
      • I tried -oldLTH and -conll2007 but it doesn't split tokens with slases (different from footnote 6 page 745)
      • Tried -rightBranching=false and the performance of MaltParser was low: around 80% instead of 90%.
      • Command: java -jar pennconverter.jar
      • Error in one sentence, skipped. I submitted a question on Stackoverflow.
    2. Stanford Basic Dependencies
      • Use Stanford parser v3.3.0 (page 745), downloaded here under the name stanford-parser-full-2013-11-12.
      • Convert PENN Treebank to Stanford Basic Dependency using: java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -conllx -originalDependencies -treeFile xxx
  5. Measure statistics: sentences, words, POS's, labels, projective percentage (Table 3)
  6. Evaluation tool:
    • Downloaded MaltEval
    • Should I use CoNLL-X eval script instead?? What is the difference between them?
    • Stanford also provides evaluation tool: "The package includes a tool for scoring of generic dependency parses, in a class edu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool."
    • Counter-intuitive observation: counting punctuation actually decrease UAS and LAS by ~3% --> the parser mistakes punctuations more often than average tokens.
  7. Run Stanford neural parser on the data and measure results.
    • Download as instructed here
  8. Run off-the-shelf MaltParser and MSTParser on dev and test sets.
  9. Implement oracle
  10. Implement parser
  11. Implement neural net
    • Dropout: it isn't clear where did they apply dropout: to the output of embedding layer or hidden layer? applied to hidden layer units.
    • The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but doesn't reveal the method (annealing/linear/etc.) and how much they didn't do it.
    • The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice" but the implementation actually compute all probabilities they implemented this by marking all invalid transitions with a "-1".
    • They measure development-set performance using UAS instead of negative log-likelihood. This makes a lot of sense because the NN is just a component in a complicated system and better performance of the NN alone doesn't necessarily translate into better performance of the system.
    • It was not just UAS but non-punctuation UAS. This can have an effect since the same evaluation scheme will be used to evaluate the system.
    • Note from the source code: output layer doesn't have bias terms which is consistent with the paper -- there is no bias in feature templates.
    • Difference in AdaGrad: Chen & Manning use epsilon=1e-6 added before taking square root while torch7 uses 1e-10 after square root. I experimented with the two methods and they are very different, at least in training. While with torch7's adagrad, the UAS jumps to 81.1% after the first epoch (i.e. ~200 batch updates), the with Stanford's adagrad, it is only 19.1%. Perhaps lowering learning rate will have the same effect but I'm not sure. Will it have an effect on performance?


  1. Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.