Focused web crawling in the acquisition of comparable corpora

  title={Focused web crawling in the acquisition of comparable corpora},
  author={Tuomas Talvensaari and Ari Pirkola and Kalervo J{\"a}rvelin and Martti Juhola and Jorma Laurikkala},
  journal={Information Retrieval},
Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words… CONTINUE READING
Highly Cited
This paper has 77 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 30 extracted citations

78 Citations

Citations per Year
Semantic Scholar estimates that this publication has 78 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-10 of 24 references

Similar Papers

Loading similar papers…