Automatic Collecting of Text Data for Cantonese Language Modeling


It is hard to collect corpora used to train good language models for many minority languages. Cantonese, one of the most popular Chinese dialects, is such a kind of language, lacking of language materials for language model training. This is a very big obstruction for the processing of Cantonese language. Unlike many other languages, there are great… (More)


