• Corpus ID: 237303821

A Statutory Article Retrieval Dataset in French

  title={A Statutory Article Retrieval Dataset in French},
  author={Antoine Louis and Gerasimos Spanakis and Gijs van Dijck},
Statutory article retrieval is the task of automatically retrieving law articles relevant to a legal question. While recent advances in natural language processing have sparked considerable interest in many legal tasks, statutory article retrieval remains primarily untouched due to the scarcity of large-scale and high-quality annotated datasets. To address this bottleneck, we introduce the Belgian Statutory Article Retrieval Dataset (BSARD), which consists of 1,100+ French native legal… 

Figures and Tables from this paper

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
The Legal General Language Understanding Evaluation Evaluation (LexGLUE) benchmark is introduced, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way and several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks are provided.


COLIEE 2020: Methods for Legal Document Retrieval and Entailment
The approaches, the official evaluation, and analysis on the data and submission results of the 7th Competition on Legal Information Extraction and Entailment are described.
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
Approximate nearest neighbor Negative Contrastive Estimation (ANCE) is presented, a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
It is found that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size, so there is still substantial room for improvement.
Datasets: A Community Library for Natural Language Processing
After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks.
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering
To investigate the performance of natural language understanding approaches on statutory reasoning, a dataset is introduced, together with a legal-domain text corpus, and straightforward application of machine reading models exhibits low out-of-the-box performance on questions, whether or not they have been fine-tuned to the legal domain.
CamemBERT: a Tasty French Language Model
This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
Contract Discovery: Dataset and a Few-shot Semantic Retrieval Challenge with Competitive Baselines
It is shown that state-of-the-art pretrained encoders fail to provide satisfactory results on the task proposed, and Language Model-based solutions perform better, especially when unsupervised fine-tuning is applied.
Dense Passage Retrieval for Open-Domain Question Answering
This work shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.
ETC: Encoding Long and Structured Inputs in Transformers
A new Transformer architecture, Extended Transformer Construction (ETC), is presented that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs.