Multilingual Clustering of Streaming News

  title={Multilingual Clustering of Streaming News},
  author={Sebasti{\~a}o Miranda and Arturs Znotins and Shay B. Cohen and Guntis Barzdins},
Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual clusters. Unlike typical clustering approaches that report results on datasets with a small and known number of labels, we tackle the problem… 

Figures and Tables from this paper

Simplifying Multilingual News Clustering Through Projection From a Shared Space

This work model the clustering process as a set of linear classifiers to aggregate similar documents, and correct closely-related multilingual clusters through merging in an online fashion, and achieves state-of-the-art results on a multilingual news stream clustering dataset.

Batch Clustering for Multilingual News Streaming

This work introduces a novel "replaying" strategy to link monolingual local topics into stories and proposes new fine tuned multilingual embedding using SBERT to create crosslingual stories.

Dense vs. Sparse Representations for News Stream Clustering

The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which is argued is more suitable for the task.

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

It is shown that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering.

NewsEmbed: Modeling News through Pre-trained Document Representations

A novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision is proposed, and a multitask model called NewsEmbed is designed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder.

Topic Detection and Tracking with Time-Aware Document Embeddings

This work designs a neural method that fuses temporal and textual information into a single representation of news documents for event detection, and conducts ablation studies on the time representation and fusion algorithm strategies, showing that the proposed model outperforms alternative strategies.

Unsupervised Key Event Detection from Massive Text Corpora

An unsupervised key event detection framework, EvMine, that extracts temporally frequent peak phrases using a novel ttf-itf score, merges peak phrases into event-indicative feature sets by detecting communities from a designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals.

Using Generative Pretrained Transformer-3 Models for Russian News Clustering and Title Generation tasks

The paper presents a methodology for news clustering and news headline generation based on the zero-shot approach and minimal tuning of the RuGPT-3 architecture (Generative Pretrained Transformer 3 for Russian) which requires no training or model fine-tuning.

BERT for Russian news clustering

This paper provides results of participation in the Russian News Clustering task within Dialogue Evaluation 2021 and proposes two methods based on BERT for news clustering, one of them shows competitive results in Dialogue 2021 evaluation.

Topic-time Heatmaps for Human-in-the-loop Topic Detection and Tracking

A human-in-the-loop method that helps users iteratively fine-tune TDT algorithms so that both the algorithms and the users themselves better understand the nature of the events.



Unified analysis of streaming news

This paper presents a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve.

News Across Languages - Cross-Lingual Document Similarity and Event Tracking

This work addresses the problem of tracking of events in a large multilingual stream within a recently developed system Event Registry and shows there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data.

The SUMMA Platform: A Scalable Infrastructure for Multi-lingual Multi-media Monitoring

The open-source SUMMA Platform is a highly scalable distributed architecture for monitoring a large number of media broadcasts in parallel, with a lag behind actual broadcast time of at most a few

Massively Multilingual Word Embeddings

New methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space are introduced and a new evaluation method is shown to correlate better than previous ones with two downstream tasks.

Software Framework for Topic Modelling with Large Corpora

This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

A Framework for Clustering Massive Text and Categorical Data Streams

This work presents an online approach for clustering massive text and categorical data streams with the use of a statistical summarization methodology and presents results illustrating the effectiveness of the technique.

Distributed Representations of Sentences and Documents

Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.

Translation Invariant Word Embeddings

This work proposes a simple and scalable method that is inspired by the notion that the learned vector representations should be invariant to translation between languages, and shows empirically that this method outperforms prior work on multilingual tasks, matches the performance of Prior work on monolingual tasks, and scales linearly with the size of the input data.

Unsupervised Deep Embedding for Clustering Analysis

Deep Embedded Clustering is proposed, a method that simultaneously learns feature representations and cluster assignments using deep neural networks and learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective.

Topic Detection and Tracking Pilot Study Final Report

Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem