Elephant: Sequence Labeling for Word and Sentence Segmentation

@inproceedings{Evang2013ElephantSL,
  title={Elephant: Sequence Labeling for Word and Sentence Segmentation},
  author={Kilian Evang and Valerio Basile and Grzegorz Chrupala and Johan Bos},
  booktitle={EMNLP},
  year={2013}
}
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that highaccuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian… CONTINUE READING
Highly Cited
This paper has 34 citations. REVIEW CITATIONS

From This Paper

Figures, tables, results, and topics from this paper.

Key Quantitative Results

  • We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.

Citations

Publications citing this paper.
Showing 1-10 of 23 extracted citations

References

Publications referenced by this paper.
Showing 1-10 of 20 references

Similar Papers

Loading similar papers…