Proper Name Extraction from Non-Journalistic Texts

@inproceedings{Poibeau2000ProperNE,
  title={Proper Name Extraction from Non-Journalistic Texts},
  author={T. Poibeau and Leila Kosseim},
  booktitle={The Clinician},
  year={2000}
}
This paper discusses the influence of the corpus on the automatic identification of proper names in texts. Techniques developed for the newswire genre are generally not sufficient to deal with larger corpora containing texts that do not follow strict writing constraints (for example, e-mail messages, transcriptions of oral conversations, etc). After a brief review of the research performed on news texts, we present some of the problems involved in the analysis of two different corpora: e-mails… 

Tables from this paper

MODERN STATISTICAL AND LINGUISTIC APPROACHES TO PROCESSING TEXTS IN NATURAL LANGUAGES

The aim of this paper is to provide an overview of modern approaches to text processing using the example of the tasks of named entities recognition and identifying the relationships between them.

Unsupervised Extraction of Keywords from News Archives

A comparison of four unsupervised algorithms to automatically acquire the set of keywords that best characterise a particular multimedia archive: the Belga News Archive shows that the most successful algorithm is TextRank, derived from Google's PageRank.

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

Evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi is reported and a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium is evaluated.

”A Novel of Character”: Towards the Automatic Annotation of Characters in a Large Corpus of French Novels

It is shown that the automatic annotation of large literary corpora makes it possible to check whether traditional classifications exhibit specific structural patterns that could be identified automatically.

A Method for Proper Noun Extraction in Kurdish

An application based on an architecture which includes a number of name lists, a set of rules, and a setof processes that recognizes Kurdish person names can help the study of Information Retrieval (IR) in Kurdish to advance and can also be used in Kurdish machine translation.

Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

First large scale trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi is reported, first published large scale results of N ER in a historical Finnish OCRed newspaper collection.

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

First trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.

Kalpa Publications in Computing

This paper describes some of the main ideas towards a method to associate locations with geographical data removing possible confusion between entities with the same name, and describes the research proposal focusing in ambiguity detection.

Name identification and extraction with formal concept analysis

  • K. Taghva
  • Computer Science
    Int. J. Mach. Learn. Cybern.
  • 2017
This paper describes how FCA identifies and extracts personal names as units of thought similar to the decoding of text sequences by Viterbi algorithm as used with Hidden Markov Models.

Name identification and extraction with formal concept analysis

  • K. Taghva
  • Computer Science
    International Journal of Machine Learning and Cybernetics
  • 2016
This paper describes how FCA identifies and extracts personal names as units of thought similar to the decoding of text sequences by Viterbi algorithm as used with Hidden Markov Models.
...

NAMED ENTITY EXTRACTION FROM SPEECH

A hidden Markov model is used to extract information from broadcast news with encouraging result that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer.

Using Collocation Statistics in Information Extraction

The main objective in participating MUC-7 is to investigate and experiment with the use of collocation statistics in information extraction, which refers to the frequency counts of the collocational relations extracted from a parsed corpus.

Named Entity Extraction from Broadcast News

This paper explores the effects of word error rate, loss of textual clues, amount of training data, changes in guidelines, and out-of-vocabulary errors in the context of the Hub4e-IE evaluation.

Locating Noun Phrases with Finite State Transducers

We present a method for constructing, maintaining and consulting a database of proper nouns. We describe noun phrases composed of a proper noun and/or a description of a human occupation. They are

Combining words and prosody for information extraction from speech

In experiments on the Broadcast News corpus, it is found that prosodic cues alone allow sentence and topic segmentation that is at least as good as word-based methods alone, and that combining both types of cues gives significant wins.

FASTUS: A Finite-state Processor for Information Extraction from Real-world Text

FASTUS has been evaluated on several blind tests that demonstrate that state-of-the-art performance on information-extraction tasks is obtainable with surprisingly little computational effort.

exibum : Un systeme experimental d'extraction d'information bilingue

The rapid results obtained through this experiment demonstrate the great advantage of system re-use in this domain, and leave us optimistic for the future development of multilingual information extraction systems.

The context of oral and written language: A framework for mode and medium switching

ABSTRACT This article demonstrates that our descriptions of orality and literacy – from the traditional dichotomy to the more recent continuum – are inadequate, largely because they are grounded in

Electric language : A new variety of English

Les As. analysent les traits lexicaux et grammaticaux d'un important corpus de messages en Communication Mediatisee par Ordinateur (CMO) envoyes a un systeme electronique de tableau d'affichage au

MITRE: description of the Alembic system used for MUC-6

As with several other veteran MUC participants, MITRE's Alembic system has undergone a major transformation in the past two years. The genesis of this transformation occurred during a dinner