• Corpus ID: 14429136

Web-scale Surface and Syntactic n-gram Features for Dependency Parsing

  title={Web-scale Surface and Syntactic n-gram Features for Dependency Parsing},
  author={Dominick Ng and Mohit Bansal and James R. Curran},
We develop novel first- and second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books. We also extend previous work on surface $n$-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks. Surface and syntactic $n$-grams both produce substantial and complementary gains in parsing accuracy… 

Figures and Tables from this paper

Improving Dependency Parsing on Clinical Text with Syntactic Clusters from Web Text
This paper proposes to gain syntactic knowledge from web text as syntactic cluster features to improve dependency parsing on clinical text and clusters words according to their distributed representation, and uses these syntactic clusters features to solve the data sparseness problem.
Evaluating Parsers with Dependency Constraints
A constraint-based evaluation for dependency and Combinatory Categorial Grammar (ccg) parsers is developed, based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model.
A Topological Approach to Compare Document Semantics Based on a New Variant of Syntactic N-grams
A new variant of sn-grams named generalized phrases (GPs) is proposed and a topological approach, named DSCoH, is proposed to compute document semantic similarities, which has been extensively tested on the document semantics comparison and the document clustering tasks.
Research Statement Mohit Bansal
This research addresses the various deep and subtle semantic ambiguities in natural language by learning novel weakly-labeled and cross-modal semantic representations with accurate, well-formulated disambiguation models, achieving the state-of-the-art on various core NLP tasks and multimodal applications.


Web-Scale Features for Full-Scale Parsing
This work first presents a method for generating web count features that address the full range of syntactic attachments, and integrates these features into full-scale dependency and constituent parsers.
Improving Dependency Parsing with Subtrees from Auto-Parsed Data
First, a baseline parser is used to parse large-scale unannotated data, then subtrees from dependency parse trees in the auto-parsed data are extracted, and new subtree-based features for parsing algorithms are constructed.
Overview of the 2012 Shared Task on Parsing the Web
A shared task on parsing web text from the Google Web Treebank to build a single parsing system that is robust to domain changes and can handle noisy text that is commonly encountered on the web is described.
Using Web-scale N-grams to Improve Base NP Parsing Performance
The web-scale N-grams are used in a base NP parser that correctly analyzes 95.4% of the base NPs in natural text and improves performance log-linearly with the number of parameters in the model.
Semi-Supervised Feature Transformation for Dependency Parsing
This paper proposes a novel semi-supervised approach to addressing the problem of data sparseness problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data.
A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books
A dataset of syntactic-ngrams (counted dependency-tree fragments) based on a corpus of 3.5 million English books includes temporal information, facilitating new kinds of research into lexical semantics over time.
Extended Constituent-to-Dependency Conversion for English
A new method to convert English constituent trees using the Penn Treebank annotation style into dependency trees was described, which used a richer set of edge labels and introduced links to handle long-distance phenomena such as wh-movement and topicalization.
Syntactic Annotations for the Google Books NGram Corpus
A new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages, is presented, which will facilitate the study of linguistic trends, especially those related to the evolution of syntax.
Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations
By including unlabeled data features into a factorization of the problem which matches the representation of prepositions and conjunctions, this work achieves a new state-of-the-art for English dependencies with 93.55% correct attachments on the current standard.
Online Learning of Approximate Dependency Parsing Algorithms
In this paper we extend the maximum spanning tree (MST) dependency parsing framework of McDonald et al. (2005c) to incorporate higher-order feature representations and allow dependency structures