An Unsupervised Approach to Biography Production Using Wikipedia

@inproceedings{Biadsy2008AnUA,
  title={An Unsupervised Approach to Biography Production Using Wikipedia},
  author={Fadi Biadsy and Julia Hirschberg and Elena Filatova},
  booktitle={ACL},
  year={2008}
}
We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges… 

Figures and Tables from this paper

Recognizing Biographical Sections in Wikipedia
TLDR
This work investigates the task of recognizing biographical sections from persons, and model this as a sequence classification problem, and proposes a supervised setting, in which the training data are acquired automatically.
Extraction of Biographical Information from Wikipedia Texts
TLDR
This paper proposes to segment documents into sequences of sentences, afterwards classifying each sentence as describing either a specific type of biographical fact, or some other case not related to biographical data.
Biographical Semi-Supervised Relation Extraction Dataset
TLDR
Biographical, the first semi-supervised dataset for RE, is developed and demonstrated the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluating it on a manually annotated gold standard set.
Unsupervised Discovery of Biographical Structure from Text
TLDR
In a quantitative evaluation at the task of predicting a person’s age for a given event, the generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features.
Learning Named Entity Recognition from Wikipedia
TLDR
Wikipedia is viable as a source of automatically-annotated training corpora, which have wide domain coverage applicable to a broad range of NLP applications, and can outperform manually-annotate corpora on this cross-corpus evaluation task.
A classifier to determine which Wikipedia biographies will be accepted
TLDR
This paper presents and analyzes a set of simple indicators that can be used to predict which article will eventually be accepted on Wikipedia and successfully reached a high predictive performance.
Re-ranking Summaries Based on Cross-Document Information Extraction
TLDR
A method to automatically incorporate IE results into sentence ranking based on cross-document information extraction results is described, which can significantly improve a high-performing multi-document summarization system.
Summarization- and learning-based approaches to information distillation
TLDR
This paper investigates two perspectives that use shallow language processing for answering open-ended distillation queries, and investigates the merit of using the ROUGE metric for its ability to evaluate redundancy alongside the conventionally used F-measure for evaluating distillation systems.
TWAG: A Topic-Guided Wikipedia Abstract Generator
TLDR
A two-stage model TWAG is proposed that guides the abstract generation with topical information and outperforms various existing baselines and is capable of generating comprehensive abstracts.
Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language
TLDR
The potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English is investigated and can be applied as a tool to help writers craft simple text.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Multi-Document Biography Summarization
TLDR
A biography summarization system using sentence classification and ideas from information retrieval to generate multi-document biographies is described, among the top performers in task 5–short summaries focused by person questions.
Statistical Acquisition of Content Selection Rules for Natural Language Generation
TLDR
This paper presents a method to acquire content selection rules automatically from a corpus of text and associated semantics and evaluated by comparing its output with information selected by human authors in unseen texts, where it was able to filter half the input data set without loss of recall.
The Automatic Content Extraction (ACE) Program - Tasks, Data, and Evaluation
The objective of the ACE program is to develop technology to automatically infer from human language data the entities being mentioned, the relations among these entities that are directly expressed,
The Automated Acquisition of Topic Signatures for Text Summarization
TLDR
A method for automatically training topic signatures-sets of related words, with associated weights, organized around head topics, is described and illustrated with signatures the authors created with 6,194 TREC collection texts over 4 selected topics.
Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
TLDR
An effective knowledge-lean method for learning content models from unannotated documents is presented, utilizing a novel adaptation of algorithms for Hidden Markov Models and applied to two complementary tasks: information ordering and extractive summarization.
References to Named Entities: a Corpus Study
TLDR
A corpus study is performed to derive a statistical model for the syntactic realization of referential expressions and the interpretation of the probabilistic data helps to gain insight on how extractive summaries can be rewritten in an efficient manner to produce more fluent and easy-to-read text.
Multi-document Summarization Using Support Vector Regression
TLDR
Support Vector Regression (SVR) model is used for automatically combining the features and scoring the sentences in multi-document summarization systems, where various features will be picked out and combined into different feature sets to be tested.
Sentence Ordering in Multidocument Summarization
TLDR
An integrated strategy for ordering information is presented, combining constraints from chronological order of events and cohesion, derived from empirical observations based on experiments asking humans to order information.
Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics
TLDR
The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Tell Me What You Do and I’ll Tell You What You Are: Learning Occupation-Related Activities for Biographies
TLDR
This work uses the extracted information as features for a multi-class SVM classifier, which is then used to automatically identify the occupation of a previously unseen individual, and shows that the approach accurately identifies general and occupation-specific activities and assigns unseen individuals to the correct occupations.
...
...