Dataset for Automatic Summarization of Russian News

  title={Dataset for Automatic Summarization of Russian News},
  author={Ilya Gusev},
  • I. Gusev
  • Published 19 June 2020
  • Computer Science
  • ArXiv
Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for… 

Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods

Russian-language corpus of news articles Gazeta and the Russianlanguage parts of the MLSUM and XL-Sum corpora are used and the methods used to create summaries based on extractive and abstractive methods are investigated.

A template for the arxiv style

This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries, and employs hyperparameter tuning so that the model’s output becomes less random and more tied to the original text.

A template for the arxiv style

Improvements to the computational weight of the original deep learning model are presented and the possibility of adding a local-search phase is explored to further improve performance.

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

The WikiOmnia dataset is presented, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline, which includes every available article from Wikipedia for the Russian language.

WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia

The WikiOmnia dataset is presented, a new publicly available set of QA pairs and corresponding Russian Wikipedia article summary sections, com-posed with a fully automated generation and distribution pipeline.

Controllable Abstractive Summarization Using Multilingual Pretrained Language Model

This work shows that CTRLSum improves baseline summarization system in four languages: English, Indonesian, Spanish, and French by 1.57 in terms of average ROUGE-1, with the Indonesian model achieving state-of-the-art results.

DIALOG-22 RuATD Generated Text Detection

The pipeline for the two DIALOG-22 RuATD tasks: detecting generated text (binary task) and classification of which model was used to generate text (multiclass task) (Shamardina et al., 2022) is described and an ensemble method of different pre-trained models based on the attention mechanism is proposed.

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

This work presents the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022, and provides count-based and BERT-based baselines, along with the human evaluation on the first sub-task.

Cross-lingual Fine-tuning for Abstractive Arabic Text Summarization

This paper presents the first corpus of human-written abstractive news summaries in Arabic, hoping to lay the foundation of this line of research for this important language.



SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents

We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model for extractive summarization of documents and show that it achieves performance better than or comparable to

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

The NEWSROOM dataset is presented, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications between 1998 and 2017, and the summaries combine abstractive and extractive strategies.

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

This work introduces Multi-News, the first large-scale MDS news dataset, and proposes an end-to-end model which incorporates a traditional extractive summarization model with a standard SDS model and achieves competitive results on MDS datasets.

Generic text summarization using relevance measure and latent semantic analysis

This paper proposes two generic text summarization methods that create text summaries by ranking and extracting sentences from the original documents, and uses the latent semantic analysis technique to identify semantically important sentences, for summary creations.

Variations of the Similarity Function of TextRank for Automated Summarization

New alternatives to the similarity function for the TextRank algorithm for automatic summarization of texts achieve a significative improvement using the same metrics and dataset as the original publication.

Extractive Summarization Using Supervised and Semi-Supervised Learning

This paper investigates co-training by combining labeled and unlabeled data and shows that this semi-supervised learning approach achieves comparable performance to its supervised counterpart and saves about half of the labeling time cost.

Self-Attentive Model for Headline Generation

This work applied recent Universal Transformer architecture paired with byte-pair encoding technique and achieved new state-of-the-art results on the New York Times Annotated corpus, presenting the new RIA corpus and reaching ROUGE-L F1-score 36.81 and RouGE-2 F2-score 22.15.

Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

A novel abstractive model is proposed which is conditioned on the article’s topics and based entirely on convolutional neural networks, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans.

Text Summarization with Pretrained Encoders

This paper introduces a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences and proposes a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two.

Diverse Beam Search for Increased Novelty in Abstractive Summarization

A novel method is presented, that relies on a diversity factor in computing the neural network loss, to improve the diversity of the summaries generated by any neural abstractive model implementing beam search.