Linguistically Motivated Large-Scale NLP with C&C and Boxer

Abstract

The statistical modelling of language, together with advances in wide-coverage grammar development, have led to high levels of robustness and efficiency in NLP systems and made linguistically motivated large-scale language processing a possibility (Matsuzaki et al., 2007; Kaplan et al., 2004). This paper describes an NLP system which is based on syntactic and semantic formalisms from theoretical linguistics, and which we have used to analyse the entire Gigaword corpus (1 billion words) in less than 5 days using only 18 processors. This combination of detail and speed of analysis represents a breakthrough in NLP technology. The system is built around a wide-coverage Combinatory Categorial Grammar (CCG) parser (Clark and Curran, 2004b). The parser not only recovers the local dependencies output by treebank parsers such as Collins (2003), but also the long-range depdendencies inherent in constructions such as extraction and coordination. CCG is a lexicalized grammar formalism, so that each word in a sentence is assigned an elementary syntactic structure, in CCG’s case a lexical category expressing subcategorisation information. Statistical tagging techniques can assign lexical categories with high accuracy and low ambiguity (Curran et al., 2006). The combination of finite-state supertagging and highly engineered C++ leads to a parser which can analyse up to 30 sentences per second on standard hardware (Clark and Curran, 2004a). The C&C tools also contain a number of Maximum Entropy taggers, including the CCG supertagger, a POS tagger (Curran and Clark, 2003a), chunker, and named entity recogniser (Curran and Clark, 2003b). The taggers are highly efficient, with processing speeds of over 100,000 words per second. Finally, the various components, including the morphological analyser morpha (Minnen et al., 2001), are combined into a single program. The output from this program— a CCG derivation, POS tags, lemmas, and named entity tags — is used by the module Boxer (Bos, 2005) to produce interpretable structure in the form of Discourse Representation Structures (DRSs).

Extracted Key Phrases

2 Figures and Tables

0204020072008200920102011201220132014201520162017
Citations per Year

237 Citations

Semantic Scholar estimates that this publication has 237 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Curran2007LinguisticallyML, title={Linguistically Motivated Large-Scale NLP with C&C and Boxer}, author={James R. Curran and Stephen Clark and Johan Bos}, booktitle={ACL}, year={2007} }