Using online electronic newspapers in modern English-Language press corpora: Benefits and pitfalls

  • Tobias Rademann
  • Published 1998


Newspapers have always provided a welcome basis for linguistic analysis. It is not only that considerable parts of major corpora such as the BNC, LOB or The Bank of English, for instance, draw heavily on newspaper articles; in recent years, the number of smaller projects where newspaper-based corpora are employed in order to investigate certain aspects of language (change) has risen considerably as well. Although newspapers do obviously not present a reliable sample of an entire language, their high popularity with respect to linguistic corpora has different reasons: besides the fact that they are easily accessible, their most important advantage is that there is hardly any other domain which offers such a broad number of linguistically distinctive varieties (Crystal 1994: 388), because they contain, among numerous other text types, leaders, essays, reviews, columns and even advertisements and cartoons. It is this significant variety within the genre itself that makes a collection of newspaper articles a much more representative sample of a given language than most others. In addition, newspapers can usually be classified according to social aspects (eg quality vs popular press) and regional aspects (eg UK vs US), thus enabling researchers to conduct synchronic studies investigating social and regional differences in language use. Furthermore, since at least the major newspapers tend to have considerably large target audiences, the language used in newspaper articles is often assumed to be characteristic of the respective period and society they are published in. Thus diachronic comparisons, ie studies investigating how far a given linguistic feature has changed over time, typically rely on historical newspaper corpora containing selected editions from different periods. This means, on the other hand, that today’s newspapers present a valuable database for anyone interested in current language usage. And

4 Figures and Tables

