• Corpus ID: 14890580

Cross-lingual Information Extraction from Web pages : the use of a general-purpose Text Engineering Platform

  title={Cross-lingual Information Extraction from Web pages : the use of a general-purpose Text Engineering Platform},
  author={Georgios Petasis and Vangelis Karkaletsis and Constantine D. Spyropoulos},
In this paper we present how the use of a general-purpose text engineering platform has facilitated the development of a cross-lingual information extraction system and its adaptation to new domains and languages. Our approach for crosslingual information extraction from the Web covers all the way from the identification of Web sites of interest, to the location of the domainspecific Web pages, to the extraction of specific information from the Web pages and its presentation to the end-user… 

Figures from this paper

Information Retrieval and Extraction from the Web: the CROSSMARC approach
The paper presents the CROSSMARC approach for the complex task of identification of interesting web sites and web pages and the extraction of information from them by adopting and implementing an open, multi-lingual and multi-agent architecture and providing an infrastructure that facilitates customization of its components to new domains and languages.
Automated ontology instantiation from tabular web sources - The AllRight system
The AllRight ontology instantiation system is presented, which supports the full instantiation life-cycle and addresses the above-mentioned challenges through a combination of new and existing techniques.
AllRight: Automatic Ontology Instantiation from Tabular Web Documents
The techniques implemented in ALLRIGHT are designed for application scenarios, in which the desired instance information is given in the form of tables and for which existing Information Extraction approaches based on statistical or natural language processing methods are not directly applicable.
Ellogon: A Natural Language Engineering Infrastructure
This paper presents Ellogon, a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon was designed in order to aid both researchers in natural
Text Area Identification in Web Images
This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results and demonstrates the efficiency of the methodology.
A Hybrid Method for Domain Ontology Construction from the Web
The proposed approach can effectively construct a cancer domain ontology from unstructured text documents and outperforms both purely statistical and purely semantic relationships among concepts approaches.
Using the Ellogon Natural Language Engineering Infrastructure
Ellogon is a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon has been used extensively in various NLP applications. It is currently
Enhancing Ontological Knowledge Through Ontology Population and Enrichment
An incremental ontology maintenance methodology which exploits ontology population and enrichment methods to enhance the knowledge captured by the instances of the ontology and their various lexicalizations is proposed.


Learning to Extract Text-Based Information from the World Wide Web
Webfoot, a preprocessor that parses web pages into logically coherent segments based on page layout cues, is introduced and passed on to CRYSTAL, an NLP system that learns text extraction rules from example.
Ellogon: A New Text Engineering Platform
Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and managing text processing components as well as visualising textual data and their associated linguistic information.
STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources *
A wrapper-induction algorithm that generates extraction rules for Web-based information sources that are expressed as simple landmark grammars, which are a class of landmark automata that is more expressive than the existing extraction languages.
Learning to construct knowledge bases from the World Wide Web
The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.
Domain-specific Web site identification: the CROSSMARC focused Web crawler
Techniques for identifying domain specific web sites that have been implemented as part of the EC-funded R&D project, CROSSMARC are presented.
Ontology integration in a multilingual e-retail system
Inside CROSSMARC (a European research project supporting development of an agent-based multilingual information extraction system from web pages), an ontology architecture has been developed in order to organize the information provided by different resources in several languages.
BBN: Description of the SIFT System as Used for MUC-7
Abstract : For MUC-7, BBN has for the first time fielded a fully-trained system for NE, TE, and TR; results are all the output of statistical language models trained on annotated data, rather than
C4.5: Programs for Machine Learning
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.