WueDevils at SemEval-2022 Task 8: Multilingual News Article Similarity via Pair-Wise Sentence Similarity Matrices

@inproceedings{Wangsadirdja2022WueDevilsAS,
  title={WueDevils at SemEval-2022 Task 8: Multilingual News Article Similarity via Pair-Wise Sentence Similarity Matrices},
  author={Dirk Wangsadirdja and Felix Heinickel and Simon Trapp and Albin Zehe and Konstantin Kobs and Andreas Hotho},
  booktitle={International Workshop on Semantic Evaluation},
  year={2022}
}
We present a system that creates pair-wise cosine and arccosine sentence similarity matrices using multilingual sentence embeddings obtained from pre-trained SBERT and Universal Sentence Encoder (USE) models respectively. For each news article sentence, it searches the most similar sentence from the other article and computes an average score. Further, a convolutional neural network calculates a total similarity score for the article pairs on these matrices. Finally, a random forest regressor… 

Figures and Tables from this paper

SemEval-2022 Task 8: Multilingual news article similarity

A new dataset of nearly 10,000 news article pairs spanning 18 language combinations annotated for seven dimensions of similarity as SemEval 2022 Task 8 is introduced, showing human annotators are capable of reaching higher correlations and suggesting space for further progress.

References

SHOWING 1-10 OF 12 REFERENCES

SemEval-2022 Task 8: Multilingual news article similarity

A new dataset of nearly 10,000 news article pairs spanning 18 language combinations annotated for seven dimensions of similarity as SemEval 2022 Task 8 is introduced, showing human annotators are capable of reaching higher correlations and suggesting space for further progress.

Multilingual Universal Sentence Encoder for Semantic Retrieval

On transfer learning tasks, the multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embedDings.

Learning Semantic Textual Similarity from Conversations

A novel approach to learn representations for sentence-level semantic similarity using conversational data and achieves the best performance among all neural models on the STS Benchmark and is competitive with the state-of-the-art feature engineered and mixed systems for both tasks.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

Universal Sentence Encoder

It is found that transfer learning using sentence embeddings tends to outperform word level transfer with surprisingly good performance with minimal amounts of supervised training data for a transfer task.

Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

SDR is introduced, a self-supervised method for document similarity that can be applied to documents of arbitrary length and can be effectively applied to extremely long documents, exceeding the 4, 096 maximal token limit of Longformer.

Convolutional Neural Networks for Sentence Classification

The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.

Random Forests

Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Thirteen ways to look at the correlation coefficient

Abstract In 1885, Sir Francis Galton first defined the term “regression” and completed the theory of bivariate correlation. A decade later, Karl Pearson developed the index that we still use to