Natural Language Understanding Wiki
Register
(no bias for output)
Tag: sourceedit
(comment on no-bias)
Tag: sourceedit
Line 37: Line 37:
 
#* The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but doesn't reveal the method (annealing/linear/etc.) and how much.
 
#* The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but doesn't reveal the method (annealing/linear/etc.) and how much.
 
#* The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice." but the implementation [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L247 actually compute all probabilities].
 
#* The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice." but the implementation [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L247 actually compute all probabilities].
#* From the source code: [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L126 output layer doesn't have bias terms].
+
#* Note from the source code: [https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/nndep/Classifier.java#L126 output layer doesn't have bias terms] which is consistence with the paper -- there is no bias in feature templates.
 
== References ==
 
== References ==
 
<references/>
 
<references/>

Revision as of 12:51, 5 January 2016

This page documents necessary steps to reproduce results of Chen & Manning (2014)[1] for English (including re-implementation) and makes explicit decisions that aren't covered in the paper.

  1. Obtain data: WSJ part of PENN Treebank. Section 02-21 for training, 22 for development, 23 for testing.
  2. Assign POS tags using Stanford POS tagger with ten-way jackknifing of the training data
    • Reported accuracy: ≈ 97.3%
    • I used version 3.6.0 downloaded here and followed instructions in the JavaDoc.
    • Reused english-bidirectional-distsim.tagger.props. Downloaded word clusters. Fixed a crash. (Which used bidirectional5words model.)
    • Instructions say: "The part-of-speech tags used as input for training and testing were generated by the Stanford POS Tagger (using the bidirectional5words model)."
    • It's not clear how to divide the folds which can make a difference. I divide it by sentences, the accuracy is 97.18%. I also tried to divide by documents and it wasn't better.
  3. Constituent-to-dependency conversion:
    1. LTH Constituent-to-Dependency Conversion Tool
      • Downloaded pennconverter
      • The paper didn't specify command-line options or reference type of conversion
      • Head-finding rules matter
      • The default is CoNLL-2008 conventions and CoNLL-X file format
      • I tried -oldLTH and -conll2007 but it doesn't split tokens with slases (different from footnote 6 page 745)
      • Tried -rightBranching=false and the performance of MaltParser was low: around 80% instead of 90%.
      • Command: java -jar pennconverter.jar
      • Error in one sentence, skipped. I submitted a question on Stackoverflow.
    2. Stanford Basic Dependencies
      • Use Stanford parser v3.3.0 (page 745), downloaded here under the name stanford-parser-full-2013-11-12.
      • Convert PENN Treebank to Stanford Basic Dependency using: java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -conllx -originalDependencies -treeFile xxx
  4. Measure statistics: sentences, words, POS's, labels, projective percentage (Table 3)
    • TODO
  5. Evaluation tool:
    • Downloaded MaltEval
    • Should I use CoNLL-X eval script instead?? What is the difference between them?
    • Stanford also provides evaluation tool: "The package includes a tool for scoring of generic dependency parses, in a class edu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool."
  6. Run Stanford neural parser on the data and measure results.
    • Download stanford-parser-full-2014-10-31.zip as instructed here
  7. Run off-the-shelf MaltParser and MSTParser on dev and test sets.
  8. Implement oracle
  9. Implement neural net
    • Dropout: it isn't clear where did they apply dropout: to the output of embedding layer or hidden layer? applied to hidden layer units.
    • The paper implies that learning rate was varied during training ("initial learning rate of Adagrad α = 0.01.") but doesn't reveal the method (annealing/linear/etc.) and how much.
    • The paper says "A slight variation is that we compute the softmax probabilities only among the feasible transitions in practice." but the implementation actually compute all probabilities.
    • Note from the source code: output layer doesn't have bias terms which is consistence with the paper -- there is no bias in feature templates.

References

  1. Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.