• Publications
  • Influence
The PAISÀ Corpus of Italian Web Texts
TLDR
This paper provides an overview of the PAISÀ corpus of Italian web texts and an introductory description of the motivation, procedures and facilities for its creation and delivery. Expand
  • 66
  • 9
  • PDF
Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions
TLDR
In this paper, we systematically compare different filtering functions proposed across the literature with respect to seven datasets. Expand
  • 43
  • 3
  • PDF
KoKo: an L1 Learner Corpus for German
TLDR
We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. Expand
  • 18
  • 3
  • PDF
A Report on the 2020 VUA and TOEFL Metaphor Detection Shared Task
TLDR
We report on the shared task on metaphor identification on VU Amsterdam Metaphor Corpus and on a subset of the TOEFL Native Language Identification Corpus. Expand
  • 8
  • 2
  • PDF
FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabr¨ uck
TLDR
The FIASCO system implements a machine-learning approach for the automatic removal of boilerplate (navigation bars, link lists, page headers and footers) from Web pages in order to make them available as a clean and useful corpus for linguistic purposes. Expand
  • 9
  • 2
  • PDF
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
TLDR
The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Sd Tirol and from blog sites with users’ comments. Expand
  • 39
  • 1
  • PDF
Using Language Learner Data for Metaphor Detection
TLDR
This article describes the system that participated in the shared task on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). Expand
  • 12
  • 1
  • PDF
Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation
TLDR
The entities mentioned in collections of scholarly articles in the Humanities belong to different types from those familiar from news corpora, hence new resources need to be annotated to create supervised taggers for tasks such as ne extraction. Expand
  • 6
  • 1
  • PDF
Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project
TLDR
This paper introduces the project DiDi in which we collect and analyze German data of computer-mediated communication (CMC) written by internet users from the Italian province of Bolzano – South Tyrol. Expand
  • 4
  • 1
  • PDF
Structure-Preserving Pipelines for Digital Libraries
TLDR
We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure is preserved. Expand
  • 9
  • PDF