A Statistical Study of the WPT-03 Corpus

Abstract

This report presents a statistical study of WPT-03, a text corpus built from the pages of the “Portuguese Web” collected in the repository of the tumba! search engine. We give a statistical analysis of the textual contents available in the Portuguese Web, including size distributions, the language of the pages, and the terms they contain.

DOI: 10.1007/978-3-540-30228-5_34

Extracted Key Phrases

13 Figures and Tables

Cite this paper

@inproceedings{Martins2004ASS, title={A Statistical Study of the WPT-03 Corpus}, author={Bruno Martins and M{\'a}rio J. Silva}, booktitle={EsTAL}, year={2004} }