Automatic Identification of Facts in Real Internet Texts in Spanish Using Lightweight Syntactic Constraints: Problems, Their Causes, and Ways for Improvement

  • Resumen Por


By a " fact " in a given text we understand a triple <first argument, relation, second argument>, where all three components are fragments of the text. For example, in the text " The policeman saw a boy who crossed the street " two facts can be identified: <The policeman, saw, a boy> and <a boy, crossed, the street>. While humans identify these facts in a text easily, for a computer such a task is far from straightforward. Existing approaches to this task based on deep automatic linguistic analysis are too slow when processing web-size corpora, and too fragile to be of practical use when dealing with real Internet texts, while machine learning approaches are too computationally complex and imprecise. It was previously shown that a very simple and fast approach based on lightweight syntactic constraints can achieve comparable performance with much lower computational and implementation complexity. However, it is prone to certain errors that are specific to this approach. These errors have not been analyzed previously. In this paper, we analyze and classify the main types of errors in fact extractions performed by the system ExtrHech, a state-of-the-art fact extraction system for Spanish based on lightweight syntactic constraints. We also identify their causes and suggest ways of possible solutions with corresponding analysis of their cost and scale of impact.

7 Figures and Tables

Showing 1-10 of 18 references

This file shows a preliminary version that may differ from the final version

  • 2015

Extracción automática de información semántica basada en estructuras sintácticas

  • Aguilar Galicia
  • 2012
2 Excerpts

This file shows a preliminary version that may differ from the final version Open Language Learning for Information Extraction

  • Mausam, M Schmitz, R Bart, S Soderland, O Etzioni
  • 2012
2 Excerpts