J. Scott McCarley

Learn More
IBM's story segmentation uses a combination of decision tree and maximum entropy models. They take a variety of lexical, prosodic, semantic, and structural features as their inputs. Both types of models are source-speciic, and we substantially lower Cseg by combining them. IBM's topic detection system introduces a minimal hierarchy into the clustering: each(More)
Previous comparisons of document and query translation suffered difficulty due to differing quality of machine translation in these two opposite directions. We avoid this difficulty by training identical statistical translation models for both translation directions using the same training data. We investigate information retrieval between En-glish and(More)
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Mono-lingual, unannotated text can be(More)
We investigate the effect of speech-recognition errors on a system for the unsupervised, nearly synchronous clustering of broadcast news stories, using the TDT (Topic Detection and Tracking) Corpora. Two questions are addressed: (1) Are speech recognition errors detrimental to the performance of the system? (2) Can a background collection of contemporaneous(More)
Models of word alignment built as sequences of links have limited expressive power, but are easy to decode. Word aligners that model the alignment matrix can express arbitrary alignments , but are difficult to decode. We propose an alignment matrix model as a correction algorithm to an underlying sequence-based aligner. Then a greedy decoding algorithm(More)
In this paper we present algorithms for story segmentation and topic detection. Both algorithms are online algorithms and use a combination of machine learning, statistical natural language processing and information retrieval techniques. The story segmentation algorithm is a two stage algorithm that uses a decision tree based probabilistic model in the rst(More)