Changes: Reproducing Chen & Manning (2014)

Latest revision as of 19:09, 12 January 2016

This page documents necessary steps to reproduce results of Chen & Manning (2014)^[1] for English (including re-implementation) and makes explicit decisions that aren't covered in the paper.

Obtain data: WSJ part of PENN Treebank. Section 02-21 for training, 22 for development, 23 for testing.
- ~~I used this revised version: LDC2015T13~~ LTH converter doesn't work with this version.
- I used PENN Treebank 3.
NP-bracketing: No.
- Tried to run against WSJ with and without NP-bracketing and found a difference of 3%
- Apparently, the reported result was produced without NP-bracketing, which makes the task artificially easier and the output less meaningful.
Assign POS tags using Stanford POS tagger with ten-way jackknifing of the training data
- Reported accuracy: ≈ 97.3%
- I used version 3.6.0 downloaded here and followed instructions in the JavaDoc.
- Reused english-bidirectional-distsim.tagger.props. Downloaded word clusters. Fixed a crash. (Which used bidirectional5words model.)
- Instructions say: "The part-of-speech tags used as input for training and testing were generated by the Stanford POS Tagger (using the bidirectional5words model)."
- It's not clear how to divide the folds which can make a difference. I divide it by sentences, the accuracy is 97.18%. I also tried to divide by documents and it wasn't better.
Constituent-to-dependency conversion:
1. LTH Constituent-to-Dependency Conversion Tool
  - Downloaded pennconverter
  - The paper didn't specify command-line options or reference type of conversion
  - Head-finding rules matter
  - The default is CoNLL-2008 conventions and CoNLL-X file format
  - I tried -oldLTH and -conll2007 but it doesn't split tokens with slases (different from footnote 6 page 745)
  - Tried -rightBranching=false and the performance of MaltParser was low: around 80% instead of 90%.
  - Command: java -jar pennconverter.jar
  - Error in one sentence, skipped. I submitted a question on Stackoverflow.
2. Stanford Basic Dependencies
  - Use Stanford parser v3.3.0 (page 745), downloaded here under the name stanford-parser-full-2013-11-12.
  - Convert PENN Treebank to Stanford Basic Dependency using: java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -conllx -originalDependencies -treeFile xxx
Measure statistics: sentences, words, POS's, labels, projective percentage (Table 3)
Evaluation tool:
- Downloaded MaltEval
- Should I use CoNLL-X eval script instead?? What is the difference between them?
- Stanford also provides evaluation tool: "The package includes a tool for scoring of generic dependency parses, in a class edu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool."
- Counter-intuitive observation: counting punctuation actually decrease UAS and LAS by ~3% --> the parser mistakes punctuations more often than average tokens.
Run Stanford neural parser on the data and measure results.
- Download stanford-parser-full-2014-10-31.zip as instructed here
Run off-the-shelf MaltParser and MSTParser on dev and test sets.
Implement oracle
Implement parser
- Minuscule detail in the implementation: right child is to the right of the node of interest and left child is to the left.
Implement neural net
- Dropout: ~~it isn't clear where did they apply dropout: to the output of embedding layer or hidden layer?~~ applied to hidden layer units.
- The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but ~~doesn't reveal the method (annealing/linear/etc.) and how much~~ they didn't do it.
  - How could I have not known it before? "Adagrad has the natural effect of decreasing the effective step size as a function of time." So they were right even although no explicit annealing scheme is used.
- The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice" ~~but the implementation actually compute all probabilities~~ they implemented this by marking all invalid transitions with a "-1".
- They measure development-set performance using UAS instead of negative log-likelihood. This makes a lot of sense because the NN is just a component in a complicated system and better performance of the NN alone doesn't necessarily translate into better performance of the system.
- It was not just UAS but non-punctuation UAS. This can have an effect since the same evaluation scheme will be used to evaluate the system.
- Note from the source code: output layer doesn't have bias terms which is consistent with the paper -- there is no bias in feature templates.
- Difference in AdaGrad: Chen & Manning use epsilon=1e-6 added before taking square root while torch7 uses 1e-10 after square root. I experimented with the two methods and they are very different, at least in training. While with torch7's adagrad, the UAS jumps to 81.1% after the first epoch (i.e. ~200 batch updates), the with Stanford's adagrad, it is only 19.1%. Perhaps lowering learning rate will have the same effect but I'm not sure. Will it have an effect on performance?

References[]

↑ Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.

[1] Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.

[1]

@@ Line 4: / Line 4: @@
 #* <s>I used this revised version: [https://catalog.ldc.upenn.edu/LDC2015T13 LDC2015T13]</s> LTH converter doesn't work with this version.
 #* I used [https://catalog.ldc.upenn.edu/LDC99T42 PENN Treebank 3].
+# [http://sydney.edu.au/engineering/it/~dvadas1/?Noun_Phrases NP-bracketing]: No.
+#* Tried to run against WSJ with and without NP-bracketing and found a difference of 3%
+#* Apparently, the reported result was produced without NP-bracketing, which makes the task artificially easier and the output less meaningful.
 # Assign POS tags using Stanford POS tagger with [[Dependency parsing experiments#Preprocessing: POS tagging|ten-way jackknifing]] of the training data
 #* Reported accuracy: ≈ 97.3%
@@ Line 24: / Line 27: @@
 ##* Convert PENN Treebank to Stanford '''Basic''' Dependency using: <code>java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -conllx -originalDependencies -treeFile xxx</code>
 # Measure statistics: sentences, words, POS's, labels, projective percentage (Table 3)
-#* '''TODO'''
 # Evaluation tool:
 #* Downloaded [http://www.maltparser.org/malteval.html MaltEval]
@@ Line 39: / Line 41: @@
 #* Dropout: <s>it isn't clear where did they apply dropout: to the output of embedding layer or hidden layer?</s> [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L197 applied to hidden layer units].
 #* The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but <s>doesn't reveal the method (annealing/linear/etc.) and how much</s> [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/DependencyParser.java#L688 they didn't do it].
+#** How could I have not known it before? "[http://xcorr.net/2014/01/23/adagrad-eliminating-learning-rates-in-stochastic-gradient-descent/ Adagrad has the natural effect of decreasing the effective step size as a function of time.]" So they were right even although no explicit annealing scheme is used.
-#* The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice." but the implementation [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L247 actually compute all probabilities].
-#* Note from the source code: [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L126 output layer doesn't have bias terms] which is consistence with the paper -- there is no bias in feature templates.
+#* The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice" <s>but the implementation [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L247 actually compute all probabilities]</s> they implemented this by [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/DependencyParser.java#L285 marking all invalid transitions with a "-1"].
+#* They measure development-set performance using UAS instead of negative log-likelihood. This makes a lot of sense because the NN is just a component in a complicated system and better performance of the NN alone doesn't necessarily translate into better performance of the system.
+#* It was not just UAS but non-punctuation UAS. This can have an effect since the same evaluation scheme will be used to evaluate the system.
+#* Note from the source code: [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L126 output layer doesn't have bias terms] which is consistent with the paper -- there is no bias in feature templates.
+#* Difference in AdaGrad: Chen & Manning use [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/DependencyParser.java#L1211 epsilon=1e-6] added [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L575 ''before'' taking square root] while torch7 uses [https://github.com/torch/optim/blob/master/adagrad.lua#L42 1e-10 after square root]. I experimented with the two methods and they are very different, at least in training. While with torch7's adagrad, the UAS jumps to 81.1% after the first epoch (i.e. ~200 batch updates), the with Stanford's adagrad, it is only 19.1%. Perhaps lowering learning rate will have the same effect but I'm not sure. Will it have an effect on performance?
 == References ==
 <references/>