Integrated multi-strategic Web document pre-processing for sentence and word boundary detection

@article{Shim2002IntegratedMW,
  title={Integrated multi-strategic Web document pre-processing for sentence and word boundary detection},
  author={Junhyeok Shim and Dongseok Kim and Jeongwon Cha and Gary Geunbae Lee and Jungyun Seo},
  journal={Inf. Process. Manage.},
  year={2002},
  volume={38},
  pages={509-527}
}
Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many sentence boundary and spacing errors. The objective of this paper introduces a multi-strategic… CONTINUE READING