Learn More
This paper reports on the procedure and learning models we adopted for the 'PAN 2011 Author Identification' challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing(More)
Query difficulty can be linked to a number of causes. Some of these causes can be related to the query expression itself, and can therefore be detected through a linguistic analysis of the query text. Using 16 different linguistic features, automatically computed on TREC queries, we looked for significant correlations between these features and the average(More)
We describe the Annodis corpus of discourse structures for French. The corpus joins two perspectives on discourse on a variety of textual genres: a bottom-up approach and a top-down approach. The bottom-up view builds incrementally a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level(More)
Most of Information Retrieval Systems transform natural language users' queries into bags of words that are matched to documents, also represented as bags of words. Through such process, the richness of the query is lost. In this paper we show that linguistic features of a query are good indicators to predict systems failure to answer it. The experiments(More)
Cet article présente les premiers résultats d'une campagne d'annotation de corpus à grande échelle réalisée dans le cadre du projet ANNODIS. Ces résultats concernent la partie descendante du dis-positif d'annotation, et plus spécifiquement les structures énumératives. Nous nous intéressons à la struc-turation énumérative en tant que stratégie de base de(More)
Information retrieval systems aim at answering users' needs. Information Retrieval System performances are evaluated using benchmark collections such as TREC (TExt Retrieval Conference) collections. Evaluation is generally based on global evaluation, computing average results over a set of fifty queries. Doing so, the added value of the different techniques(More)
This paper reports on the procedure and learning models we adopted for the 'PAN 2011 Author Identification' challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing(More)