László Németh

Learn More
The paper provides an overview of the open source Hungarian language resources that the SzóSzablya 'WordSword' project is creating. An extensive crawl of the .hu domain yielded a raw dataset of over 18m web pages. We discuss the methods used to detect and remove duplicates, low quality, foreign, and mixed language documents, and describe the resulting(More)
  • 1