The Unreasonable Effectiveness of Data

@article{Halevy2009TheUE,
  title={The Unreasonable Effectiveness of Data},
  author={A. Halevy and Peter Norvig and Fernando C Pereira},
  journal={IEEE Intelligent Systems},
  year={2009},
  volume={24},
  pages={8-12}
}
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. [...] Key Result So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.Expand
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
TLDR
An index of all sentences and their linguistic meta-data enabling quick search across the corpus is built, demonstrating the utility of this corpus on the verb similarity task by showing that a distributional model trained on the corpus yields better results than models trained on smaller corpora, like Wikipedia. Expand
Jenny Rose Finkel Research Statement
The field of natural language processing (NLP) is already responsible for several widely used technologies, including machine translation and automatic speech recognition, and with the rise ofExpand
Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
TLDR
It is demonstrated that derived constraints aid grammar induction by training Klein and Manning's Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the-art by more than 5%. Expand
A Method for Extracting Word Lists from the 361-billion Token Google Books Corpus
TLDR
The process of creating a software tool for extracting a new word frequency list from the Google Books corpus using lists prepared and released by Michel et al. (2011) is described. Expand
Baby Steps: How “Less is More” in Unsupervised Dependency Parsing
TLDR
An empirical study of two very simple approaches to unsupervised grammar induction based on Klein and Manning’s Dep endency Model with Valence, which requires no initialization and bootstraps itself via iterated learning of increasingly longer sentences. Expand
An Economic Approach to Big Data in a Minority Language
TLDR
An innovative and economic approach to large-scale n-gram system creation applied to the Croatian language case, instead of using the Web as the world's biggest text repository, and relies on the Croatian academic online spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Expand
Corpus Analysis of the World Wide Web
TLDR
The World Wide Web has become a primary meeting place for information and recreation, for communication and commerce, for a quarter of the world's population and as a source of machine-readable texts for corpus linguists and researchers in complementary fields like natural language processing, information retrieval, and text mining. Expand
Methods for Sentence Compression
TLDR
Three papers discussed here take different approaches to identifying important content, determining which sentences are grammatical, and jointly optimizing these objectives, and conclude with ideas for future work in this area. Expand
PELESent: Cross-Domain Polarity Classification Using Distant Supervision
TLDR
The methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach and suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message. Expand
Unsupervised Learning of Lexical Information for Language Processing Systems
TLDR
This thesis attempts to answer the question of which lexical units should be used for these applications by acquiring them through unsupervised learning, and presents models for lexical learning for speech recognition and machine translation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 22 REFERENCES
Scaling Textual Inference to the Web
TLDR
The Holmes system, which utilizes textual inference over tuples extracted from text to scale TI to a corpus of 117 million Web pages, and its runtime is linear in the size of its input corpus. Expand
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine. Expand
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds
TLDR
Inherently noisy search queries are shown to be a highly valuable, albeit unexplored, resource for Web-based information extraction, in particular for the task of class attribute extraction. Expand
Max-Margin Parsing
TLDR
A novel discriminative approach to parsing inspired by the large-margin criterion underlying support vector machines is presented, which allows one to efficiently learn a model which discriminates among the entire space of parse trees, as opposed to reranking the top few candidates. Expand
Learning to create data-integrating queries
TLDR
A system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs, and which learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. Expand
Translating Queries into Snippets for Improved Query Expansion
TLDR
It is shown that the combination of a query-to-snippet translation model with a large n-gram language model trained on queries achieves improved contextual query expansion compared to a system based on term correlations. Expand
The Unreasonable Effectiveness of Mathematics in the Natural Sciences
There is a story about two friends, who were classmates in high school, talking about their jobs. One of them became a statistician and was working on population trends. He showed a reprint to hisExpand
Towards a Quantitative, Platform-Independent Analysis of Knowledge Systems
TLDR
Exining the failure analysis and initial empirical use of the taxonomy provides quantitative insights into the strengths and weaknesses of individual systems and raises some issues shared by all three, implying the need to improve both its granularity and precision. Expand
Scene completion using millions of photographs
What can you do with a million images? In this paper we present a new image completion algorithm powered by a huge database of photographs gathered from the Web. The algorithm patches up holes inExpand
Scene completion using millions of photographs
TLDR
A new image completion algorithm powered by a huge database of photographs gathered from the Web that can generate a diverse set of image completions and allow users to select among them. Expand
...
1
2
3
...