EDGAR-CORPUS: Billions of Tokens Make The World Go Round

  title={EDGAR-CORPUS: Billions of Tokens Make The World Go Round},
  author={Lefteris Loukas and Manos Fergadiotis and Ion Androutsopoulos and Prodromos Malakasiotis},
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We… 

Figures and Tables from this paper

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

It is shown that subword fragmentation of numeric expressions harms BERT’s performance, allowing word-level BILSTMs to perform better, and two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes are proposed.

FLUEnT: Financial Language Understandability Enhancement Toolkit

The FLUEnT toolkit consists of eight different tools for tasks like hypernym detection, numeral claim analysis, readability assessment, sustainability assessment, etc and is open-source under MIT license and openly accessible from Colab and HuggingFace.

An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics

This survey provides a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics.

Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines

It is found that a lightweight bag-of-words model based on static in-domain word representations obtains surprisingly good results, especially when taking textual data from several years into account.

KPI-EDGAR: A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents

A new way of measuring the success of said extraction process is proposed by incorporating a word-level weighting scheme into the conventional F 1 score to better model the inherently fuzzy borders of the entity pairs of a relation in this domain.

Financial misstatement detection: a realistic evaluation

The evaluation process for the task of detecting financial reports with a high risk of containing a mis-statement is examined, and a new, realistic evaluation framework is proposed which focuses on the misstatement class and its rarity.

Graph-based Keyword Planning for Legal Clause Generation from Topics

This paper proposes a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses, and illustrates the effectiveness of this two-stage approach on a broad set of clause topics in contracts.



CoFiF: A Corpus of Financial Reports in French Language

CoFiF, the first corpus comprising company reports in the French language, is introduced, containing over 188 million tokens in 2655 reports, covering reference documents, annual, semestrial and trimestrial reports on the 60 largest French companies listed in France’s main stock indices CAC40 and CAC Next 20.

FinSim-3: The 3rd Shared Task on Learning Semantic Similarities for the Financial Domain

The FinSim-3 is the third edition of FinSim shared task on Learning Semantic Similarities for the Financial Domain, held in conjunction with IJCAI 2021 @Online as part of the FinNLP-2021 workshop.

Software Framework for Topic Modelling with Large Corpora

This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

Discovering Finance Keywords via Continuous-Space Language Models

The continuous bag-of-words (CBOW) model is applied to the textual information in 10-K financial reports to discover new finance keywords and is effective for discovering predictability keywords for post-event volatility, stock volatility, abnormal trading volume, and excess return predictions.

A Corpus of Corporate Annual and Social Responsibility Reports: 280 Million Tokens of Balanced Organizational Writing

We introduce JOCO, a novel text corpus for NLP analytics in the field of economics, business and management. This corpus is composed of corporate annual and social responsibility reports of the top

WWW'18 Open Challenge: Financial Opinion Mining and Question Answering

This challenge focuses on advancing the state-of-the-art of aspect-based sentiment analysis and opinion-based Question Answering for the financial domain.

Deal or No Deal: Predicting Mergers and Acquisitions at Scale

This paper explores what can be learned about M&A activity from a firm’s annual Form 10-K SEC filing and trains a classifier to predict acquirers and targets, which is used to forecast the most likely M&As of 2019.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Predicting Risk from Financial Reports with Regression

This work applies well-known regression techniques to a large corpus of freely available financial reports, constructing regression models of volatility for the period following a report, rivaling past volatility in predicting the target variable.