Previous comparisons of document and query translation suffered difficulty due to differing quality of machine translation in these two opposite directions. We avoid this difficulty by training identical statistical translation models for both translation directions using the same training data. We investigate information retrieval between En-glish and… (More)
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Mono-lingual, unannotated text can be… (More)
We investigate the effect of speech-recognition errors on a system for the unsupervised, nearly synchronous clustering of broadcast news stories, using the TDT (Topic Detection and Tracking) Corpora. Two questions are addressed: (1) Are speech recognition errors detrimental to the performance of the system? (2) Can a background collection of contemporaneous… (More)
We investigate important differences between two styles of document clustering in the context of Topic Detection and Tracking. Converting a Topic Detection system into a Topic Tracking system exposes fundamental differences between these two tasks that are important to consider in both the design and the evaluation of TDT systems. We also identify features… (More)
Our English-Chinese cross-language IR system is trained from parallel corpora; we investigate its performance as a function of training corpus size for three different training corpora. We find that the performance of the system as trained on the three parallel corpora can be related by a simple measure, namely the out-of-vocabulary rate of query words.