WueDevils at SemEval-2022 Task 8: Multilingual News Article Similarity via Pair-Wise Sentence Similarity Matrices
@inproceedings{Wangsadirdja2022WueDevilsAS, title={WueDevils at SemEval-2022 Task 8: Multilingual News Article Similarity via Pair-Wise Sentence Similarity Matrices}, author={Dirk Wangsadirdja and Felix Heinickel and Simon Trapp and Albin Zehe and Konstantin Kobs and Andreas Hotho}, booktitle={International Workshop on Semantic Evaluation}, year={2022} }
We present a system that creates pair-wise cosine and arccosine sentence similarity matrices using multilingual sentence embeddings obtained from pre-trained SBERT and Universal Sentence Encoder (USE) models respectively. For each news article sentence, it searches the most similar sentence from the other article and computes an average score. Further, a convolutional neural network calculates a total similarity score for the article pairs on these matrices. Finally, a random forest regressor…
Figures and Tables from this paper
One Citation
SemEval-2022 Task 8: Multilingual news article similarity
- Computer ScienceSEMEVAL
- 2022
A new dataset of nearly 10,000 news article pairs spanning 18 language combinations annotated for seven dimensions of similarity as SemEval 2022 Task 8 is introduced, showing human annotators are capable of reaching higher correlations and suggesting space for further progress.
References
SHOWING 1-10 OF 12 REFERENCES
SemEval-2022 Task 8: Multilingual news article similarity
- Computer ScienceSEMEVAL
- 2022
A new dataset of nearly 10,000 news article pairs spanning 18 language combinations annotated for seven dimensions of similarity as SemEval 2022 Task 8 is introduced, showing human annotators are capable of reaching higher correlations and suggesting space for further progress.
Multilingual Universal Sentence Encoder for Semantic Retrieval
- Computer ScienceACL
- 2020
On transfer learning tasks, the multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embedDings.
Learning Semantic Textual Similarity from Conversations
- Computer ScienceRep4NLP@ACL
- 2018
A novel approach to learn representations for sentence-level semantic similarity using conversational data and achieves the best performance among all neural models on the STS Benchmark and is competitive with the state-of-the-art feature engineered and mixed systems for both tasks.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Computer ScienceEMNLP
- 2019
Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.
Universal Sentence Encoder
- Computer ScienceArXiv
- 2018
It is found that transfer learning using sentence embeddings tends to outperform word level transfer with surprisingly good performance with minimal amounts of supervised training data for a transfer task.
Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference
- Computer ScienceFINDINGS
- 2021
SDR is introduced, a self-supervised method for document similarity that can be applied to documents of arbitrary length and can be effectively applied to extremely long documents, exceeding the 4, 096 maximal token limit of Longformer.
Convolutional Neural Networks for Sentence Classification
- Computer ScienceEMNLP
- 2014
The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.
Random Forests
- Computer ScienceMachine Learning
- 2004
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
- Computer ScienceNeurIPS
- 2019
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Thirteen ways to look at the correlation coefficient
- Mathematics
- 1988
Abstract In 1885, Sir Francis Galton first defined the term “regression” and completed the theory of bivariate correlation. A decade later, Karl Pearson developed the index that we still use to…