Learn More
Previous comparisons of document and query translation suffered difficulty due to differing quality of machine translation in these two opposite directions. We avoid this difficulty by training identical statistical translation models for both translation directions using the same training data. We investigate information retrieval between En-glish and(More)
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Mono-lingual, unannotated text can be(More)
We investigate the effect of speech-recognition errors on a system for the unsupervised, nearly synchronous clustering of broadcast news stories, using the TDT (Topic Detection and Tracking) Corpora. Two questions are addressed: (1) Are speech recognition errors detrimental to the performance of the system? (2) Can a background collection of contemporaneous(More)
Models of word alignment built as sequences of links have limited expressive power, but are easy to decode. Word aligners that model the alignment matrix can express arbitrary alignments , but are difficult to decode. We propose an alignment matrix model as a correction algorithm to an underlying sequence-based aligner. Then a greedy decoding algorithm(More)
We investigate important differences between two styles of document clustering in the context of Topic Detection and Tracking. Converting a Topic Detection system into a Topic Tracking system exposes fundamental differences between these two tasks that are important to consider in both the design and the evaluation of TDT systems. We also identify features(More)
In this paper we present algorithms for story segmenta-tion, topic detection, and topic tracking. The algorithms use a combination of machine learning, statistical natural language processing and information retrieval techniques. The story segmentation algorithm is a two stage algorithm that uses a decision tree based probabilistic model in the rst stage(More)