Learn More
The Natural Language Toolkit is a suite of program modules, data sets, tutorials and exercises, covering symbolic and statistical natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past three years, NLTK has become popular in teaching and research. We describe the toolkit and report on its(More)
Linguistic annotation" covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense(More)
The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized(More)
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on “Annotation Graphs,” a(More)
The process of documenting and describing the world’s languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the(More)
When phonological rules are regarded as declarative descriptions, it is possible to construct a model of phonology in which rules and representations are no longer distinguished and such procedural devices as rule-ordering are absent. In this paper we present a finite-state model of phonology in which automata are the descriptions and tapes (or strings) are(More)
This book is a revised and expanded version of the author's Ph.D. thesis (Edinburgh University, 1990), entitled Constraint-based phonology. The field of computational phonology is rather immature as a whole and there have been only a handful of publications on constraint-based theories of phonology. Bird's thesis, and its revision, stand as conceptual and(More)
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this(More)