Exploring Diachronic Lexical Semantics with JeSemE

  title={Exploring Diachronic Lexical Semantics with JeSemE},
  author={Johannes Hellrich and Udo Hahn},
Recent advances in distributional semantics combined with the availability of large-scale diachronic corpora offer new research avenues for the Digital Humanities. JESEME, the Jena Semantic Explorer, renders assistance to a non-technical audience to investigate diachronic semantic topics. JESEME runs as a website with query options and interactive visualizations of results, as well as a REST API for access to the underlying diachronic data sets. 

Figures and Tables from this paper

JeSemE: Interleaving Semantics and Emotions in a Web Service for the Exploration of Language Change Phenomena

We here introduce a substantially extended version of JeSemE, an interactive website for visually exploring computationally derived time-variant information on word meanings and lexical emotions

Diachronic Analysis of Entities by Exploiting Wikipedia Page revisions

This paper introduces a new resource for performing the diachronic analysis of named entities built upon Wikipedia page revisions, by analysing the whole history of Wikipedia internal links.

Word embeddings: reliability & semantic change

The JeSemE website is created to make word embedding based diachronic research more accessible and investigate the applicability of these methods by investigating the historical understanding of electricity as well as words connected to Romanticism.

Some steps towards the generation of diachronic WordNets

It is shown that starting from simple lists of word pairs it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.

DUKweb, diachronic word representations from the UK Web Archive corpus

DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English, is presented and the reuse potential of DUKweb and its quality standards are shown via a case study on word meaning change detection.

DiaSense at SemEval-2020 Task 1: Modeling Sense Change via Pre-trained BERT Embeddings

DiaSense, a system developed for Task 1 ‘Unsupervised Lexical Semantic Change Detection’ of SemEval 2020, uses contextualized word embeddings to model word sense changes.

A Method to Automatically Identify Diachronic Variation in Collocations.

A novel method to track collocational variations in diachronic corpora that can identify several changes undergone by these phraseological combinations and to propose alternative solutions found in later periods is introduced.

A Methodology for Building a Diachronic Dataset of Semantic Shifts and its Application to QC-FR-Diac-V1.0, a Free Reference for French

This work introduces a methodology for the construction of a reference dataset for the evaluation of semantic shift detection, that is, a list of words where the authors know for sure whether they present a word meaning change over a period of interest.

Deep Neural Models of Semantic Shift

This paper proposes a deep neural network diachronic distributional model that represents time as a continuous variable and model a word’s usage as a function of time, and creates a novel synthetic task which measures a model's ability to capture the semantic trajectory.

The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study

A new, extended version of the Royal Society Corpus is presented, a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665--1996), and its value for linguistic and humanistic study is elaborating on.



Text: now in 2D! A framework for lexical expansion with contextual similarity

A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmatic

DiaCollo : On the trail of diachronic collocations

This paper presents DiaCollo, a software tool developed in the context of CLARIN for the efficient extraction, comparison, and interactive visualization of collocations from a diachronic text corpus.

Syntactic Annotations for the Google Books NGram Corpus

A new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages, is presented, which will facilitate the study of linguistic trends, especially those related to the evolution of syntax.

Making Google Books n-grams useful for a wide range of research on language change

This paper discusses an alternative “advanced” architecture and interface for these n-grams, which allows for a wide range of research on lexical, phraseological, syntactic, and semantic changes in English, in ways that would not be possible with the standard interface.

You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction

It is shown that purely statistics-based measures reveal virtually no difference compared with frequency of occurrence counts, while linguistically more informed metrics do reveal such a marked difference.

Extracting semantic representations from word co-occurrence statistics: A computational study

This article presents a systematic exploration of the principal computational possibilities for formulating and validating representations of word meanings from word co-occurrence statistics and finds that, once the best procedures are identified, a very simple approach is surprisingly successful and robust over a range of psychologically relevant evaluation measures.

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

A robust methodology for quantifying semantic change is developed by evaluating word embeddings against known historical changes and it is revealed that words that are more polysemous have higher rates of semantic change.

The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets - Reconstructing the composition of the German corpus in times of WWII

It is shown that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all, and implies that observed changes in this period of time can only be linked directly to World War II to a certain extent.

Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful

The overall low reliability of (neural) word embeddings casts doubt on the suitability of word neighborhoods in embedding spaces as a basis for qualitative conclusions on synchronic and diachronic lexico-semantic matters, an issue currently high up in the agenda of Digital Humanities.

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

Overall, the findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.