Multilingual Media Monitoring and Text Analysis - Challenges for Highly Inflected Languages

Abstract

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?

DOI: 10.1007/978-3-642-40585-3_3

Extracted Key Phrases

2 Figures and Tables

Cite this paper

@inproceedings{Steinberger2013MultilingualMM, title={Multilingual Media Monitoring and Text Analysis - Challenges for Highly Inflected Languages}, author={Ralf Steinberger and Maud Ehrmann and J{\'u}lia Pajzs and Mohamed Ebrahim and Josef Steinberger and Marco Turchi}, booktitle={TSD}, year={2013} }