A Dependency Parser for Tweets

@inproceedings{Kong2014ADP,
  title={A Dependency Parser for Tweets},
  author={Lingpeng Kong and Nathan Schneider and Swabha Swayamdipta and Archna Bhatia and Chris Dyer and Noah A. Smith},
  booktitle={EMNLP},
  year={2014}
}
We describe a new dependency parser for English tweets, TWEEBOPARSER. The parser builds on several contributions: new syntactic annotations for a corpus of tweets (TWEEBANK), with conventions informed by the domain; adaptations to a statistical parsing algorithm; and a new approach to exploiting out-of-domain Penn Treebank data. Our experiments show that the parser achieves over 80% unlabeled attachment accuracy on our new, high-quality test set and measure the benefit of our contributions. Our… 

Figures and Tables from this paper

Parsing Tweets into Universal Dependencies
TLDR
It is shown that it is challenging to deliver consistent annotation due to ambiguity in understanding and explaining tweets and proposed a new method to distill an ensemble of 20 transition-based parsers into a single one that achieves an improvement of 2.2 in LAS over the un-ensembled baseline and outperforms parsers that are state-of-the-art on other treebanks in both accuracy and speed.
tweeDe – A Universal Dependencies treebank for German tweets
TLDR
This paper introduces the first German treebank for Twitter microtext, annotated within the framework of Universal Dependencies, and describes the data selection and annotation process and presents baseline parsing results for the new testsuite.
Dependency Parsing for Tweets
TLDR
Experimental results show that the neural tweet parser is over 15 times faster than Tweeboparser (Kong et al., 2014), the previous state-ofthe-art parser for tweets, and with tri-training data, the parser outperforms Tweeboarser.
The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions
We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables
Foreebank: Syntactic Analysis of Customer Support Forums
TLDR
A new treebank of English and French technical forum content which has been annotated for grammatical errors and phrase structure is presented to empirically measure the effect of errors on parsing performance.
Parse Imputation for Dependency Annotations
TLDR
This work describes a method for imputing missing dependencies from sentences that have been partially annotated using the Graph Fragment Language, such that a standard dependency parser can then be trained on all annotations.
An Evaluation of Parser Robustness for Ungrammatical Sentences
TLDR
This paper compares the performances of eight state-of-the-art dependency parsers on two domains of ungrammatical sentences: learner English and machine translation outputs and develops an evaluation metric that may help practitioners to choose an appropriate parser for their tasks, and help developers to improve parser robustness against un grammatical sentences.
Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach
TLDR
Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers and be able to create a dependency treebank from unlabelled tweets without any manual intervention.
Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach
TLDR
This work presents a new GFL/FUDG-annotated Chinese treebank with more than 18K tokens from Sina Weibo, and forms the dependency parsing problem as many small and parallelizable arc prediction tasks: for each task, a programmable probabilistic firstorder logic is used to infer the dependency arc of a token in the sentence.
EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus
TLDR
A manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC data is extended with manually created annotation layers for word form normalization, lemmatization and lexical semantics.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 69 REFERENCES
From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0
TLDR
It is found that the Wall-Street-Journal-trained statistical parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy.
#hardtoparse: POS Tagging and Parsing the Twitterverse
TLDR
Retraining Malt on dependency trees produced by a state-of-the-art phrase structure parser, which has itself been self-trained on Twitter material, results in a significant improvement and is analysed by examining in detail the effect of the retraining on individual dependency types.
Simple Semi-supervised Dependency Parsing
TLDR
This work focuses on the problem of lexical representation, introducing features that incorporate word clusters derived from a large unannotated corpus, and shows that the cluster-based features yield substantial gains in performance across a wide range of conditions.
Training Parsers on Incompatible Treebanks
TLDR
Two simple adaptation methods are presented based on the idea of using a shared feature representation when parsing multiple treebanks, and the second method on guided parsing where the output of one parser provides features for a second one.
Building a Treebank for French
TLDR
A treebank project for French has annotated a newspaper corpus of 1 Million words with part of speech, inflection, compounds, lemmas and constituency and presents some uses of the corpus.
Stacking Dependency Parsers
TLDR
Experiments on twelve languages show that stacking transition-based and graph-based parsers improves performance over existing state-of-the-art dependency parsers.
Three New Probabilistic Models for Dependency Parsing: An Exploration
TLDR
Preliminary empirical results from evaluating the three models' parsing performance on annotated Wall Street Journal training text (derived from the Penn Treebank) suggest the generative model performs significantly better than the others, and does about equally well at assigning part-of-speech tags.
Overview of the 2012 Shared Task on Parsing the Web
TLDR
A shared task on parsing web text from the Google Web Treebank to build a single parsing system that is robust to domain changes and can handle noisy text that is commonly encountered on the web is described.
Experiments with a Higher-Order Projective Dependency Parser
TLDR
In the multilingual exercise of the CoNLL-2007 shared task (Nivre et al., 2007), the system obtains the best accuracy for English, and the second best accuracies for Basque and Czech.
Building a Large Annotated Corpus of English: The Penn Treebank
TLDR
As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
...
1
2
3
4
5
...