• Corpus ID: 15881557

Cross-lingual Information Management from the Web

  title={Cross-lingual Information Management from the Web},
  author={Vangelis Karkaletsis and Constantine D. Spyropoulos},
This paper presents a methodology for cross-lingual information management from the Web. The methodology covers all the way from the identification of Web sites of interest (i.e. that contain Web pages relevant to a specific domain) in various languages, to the location of the domain-specific Web pages, to the extraction of specific information from the Web pages and its presentation to the end-user. The methodology has been implemented and evaluated in the context of the IST project CROSSMARC1… 

Figures and Tables from this paper


Annotating Web pages for the needs of Web Information Extraction Applications
This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web
Learning to Extract Text-Based Information from the World Wide Web
Webfoot, a preprocessor that parses web pages into logically coherent segments based on page layout cues, is introduced and passed on to CRYSTAL, an NLP system that learns text extraction rules from example.
Ontology integration in a multilingual e-retail system
Inside CROSSMARC (a European research project supporting development of an agent-based multilingual information extraction system from web pages), an ontology architecture has been developed in order to organize the information provided by different resources in several languages.
STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources *
A wrapper-induction algorithm that generates extraction rules for Web-based information sources that are expressed as simple landmark grammars, which are a class of landmark automata that is more expressive than the existing extraction languages.
Learning to construct knowledge bases from the World Wide Web
The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.
Wrapper Induction for Information Extraction
This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.
Learning Information Extraction Rules for Semi-Structured and Free Text
WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences, and can also handle extraction from free text such as news stories.
Multilingual XML-Based Named Entity Recognition for E-Retail Domains
XML is used as the common exchange format and the monolingual NERC components use a combination of rule-based and machine-learning techniques to process web pages which contain heavily structured data where text is intermingled with HTML and other code.
Boosted Wrapper Induction
This work describes an algorithm that learns simple, low-coverage wrapper-like extraction patterns, which it then applies to conventional information extraction problems using boosting, resulting in BWI, a trainable information extraction system with a strong precision bias and F1 performance better than state-of-the-art techniques in many domains.