Evaluating Text Segmentation

Abstract

This thesis investigates the evaluation of automatic and manual text segmentation. Text segmentation is the process of placing boundaries within text to create segments according to some task-dependent criterion. An example of text segmentation is topical segmentation, which aims to segment a text according to the subjective definition of what constitutes a topic. A number of automatic segmenters have been created to perform this task, and the question that this thesis answers is how to select the best automatic segmenter for such a task. This requires choosing an appropriate segmentation evaluation metric, confirming the reliability of a manual solution, and then finally employing an evaluation methodology that can select the automatic segmenter that best approximates human performance. A variety of comparison methods and metrics exist for comparing segmentations (e.g., WindowDiff, Pk), and all save a few are able to award partial credit for nearly missing a boundary. Those comparison methods that can award partial credit unfortunately lack consistency, symmetricity, intuition, and a host of other desirable qualities. This work proposes a new comparison method named boundary similarity (B) which is based upon a new minimal boundary edit distance to compare two segmentations. Near misses are frequent, even among manual segmenters (as is exemplified by the low inter-coder agreement reported by many segmentation studies). This work adapts some inter-coder agreement coefficients to award partial credit for near misses using the new metric proposed herein, B. The methodologies employed by many works introducing automatic segmenters evaluate them simply in terms of a comparison of their output to one manual segmentation of a text, and often only by presenting nothing other than a series of mean performance values (along with no standard deviation, standard error, or little if any statistical hypothesis testing). This work asserts that one segmentation of a text cannot constitute a “true” segmentation; specifically, one manual segmentation is simply one sample of the population of all possible segmentations of a text and of that subset of desirable segmentations. This work further asserts that an adapted inter-coder agreement statistics proposed herein should be used to determine the reproducibility and reliability of a coding scheme and set of manual codings, and then statistical hypothesis testing using the specific comparison methods and methodologies demonstrated herein should be used to select the best automatic segmenter. This work proposes new segmentation evaluation metrics, adapted inter-coder agreement coefficients, and methodologies. Most important, this work experimentally compares

105 Figures and Tables

Cite this paper

@inproceedings{Fournier2013EvaluatingTS, title={Evaluating Text Segmentation}, author={Christopher Fournier}, year={2013} }