Corpus ID: 14311013

Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

@article{Sahin2017AutomaticallyAT,
  title={Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers},
  author={H. Bahadir Sahin and Caglar Tirkaz and Eray Yildiz and Mustafa Tolga Eren and Omer Ozan Sonmez},
  journal={ArXiv},
  year={2017},
  volume={abs/1702.02363}
}
  • H. Bahadir Sahin, Caglar Tirkaz, +2 authors Omer Ozan Sonmez
  • Published in ArXiv 2017
  • Computer Science
  • Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNERTC) dataset is a collection of automatically categorized and annotated sentences obtained from Wikipedia. We constructed large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The constructed gazetteers contains approximately 300K entities with thousands of fine-grained entity types under 77 different domains. Since automated… CONTINUE READING

    Citations

    Publications citing this paper.
    SHOWING 1-5 OF 5 CITATIONS

    Semi-Automatic Formatting of Spelled Out Numbers

    VIEW 5 EXCERPTS
    HIGHLY INFLUENCED

    Document classification of SuDer Turkish news corpora

    VIEW 1 EXCERPT

    References

    Publications referenced by this paper.