Languages under the influence: Building a database of Uralic languages


For most of the Uralic languages, there is a lack of systematically collected, consequently transcribed and morphologically annotated text corpora. This paper sums up the steps, the preliminary results and the future directions of building a linguistic corpus of some Uralic languages, namely Tundra Nenets, Udmurt, Synya Khanty, and Surgut Khanty. The experiences of building a corpus containing both old and modern, and written and oral data samples are discussed. Principles concerning data collection strategies of languages with different level of vitality and endangerment are discussed. Methodologies and challenges of data processing, and the levels of linguistic annotation are also described in detail.

1 Figure or Table

Cite this paper

@inproceedings{Simon2017LanguagesUT, title={Languages under the influence: Building a database of Uralic languages}, author={Eszter Simon and Nikolett Mus}, year={2017} }