Automatic Identification of Facts in Real Internet Texts in Spanish Using Lightweight Syntactic Constraints: Problems, Their Causes, and Ways for Improvement

By a " fact " in a given text we understand a triple <first argument, relation, second argument>, where all three components are fragments of the text. For example, in the text " The policeman saw a boy who crossed the street " two facts can be identified: <The policeman, saw, a boy> and <a boy, crossed, the street>. While humans identify these facts in a text easily, for a computer such a task is far from straightforward. Existing approaches to this task based on deep automatic linguistic analysis are too slow when processing web-size corpora, and too fragile to be of practical use when dealing with real Internet texts, while machine learning approaches are too computationally complex and imprecise. It was previously shown that a very simple and fast approach based on lightweight syntactic constraints can achieve comparable performance with much lower computational and implementation complexity. However, it is prone to certain errors that are specific to this approach. These errors have not been analyzed previously. In this paper, we analyze and classify the main types of errors in fact extractions performed by the system ExtrHech, a state-of-the-art fact extraction system for Spanish based on lightweight syntactic constraints. We also identify their causes and suggest ways of possible solutions with corresponding analysis of their cost and scale of impact.

