Patents Phrase to Phrase Semantic Matching Dataset

@article{Aslanyan2022PatentsPT,
  title={Patents Phrase to Phrase Semantic Matching Dataset},
  author={Grigor Aslanyan and Ian Wetherbee},
  journal={ArXiv},
  year={2022},
  volume={abs/2208.01171}
}
There are many general purpose benchmark datasets for Semantic Textual Similarity but none of them are focused on technical concepts found in patents and scientific publications. This work aims to fill this gap by presenting a new human rated contextual phrase to phrase matching dataset. The entire dataset contains close to 50 , 000 rated phrase pairs, each with a CPC (Cooperative Patent Classification) class as a context. This paper describes the dataset and some baseline models. 

Tables from this paper

Logic Mill - A Knowledge Navigation System

The Logic Mill system is a scalable and openly accessible soft- ware system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora and can be extended to text corpora from other domains.

Recent Developments in AI and USPTO Open Data

An emerging class of usecases directed to the research, development, and application of artificial intelligence technology contemplate both the delivery ofArtic intelligence capabilities for practical IP applications and the enablement of future state-of-the-art artic intelligence research via USPTO data products.

References

SHOWING 1-10 OF 14 REFERENCES

A SICK cure for the evaluation of compositional distributional semantic models

This work aims to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them.

Advances in Pre-Training Distributed Word Representations

This paper shows how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together to outperform the current state of the art by a large margin on a number of tasks.

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.

SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)

In this shared task, evaluations on two related tasks Paraphrase Identification and Semantic Textual Similarity (SS) systems for the Twitter data are presented and the importance to bringing these two research areas together is suggested.

WordNet: A Lexical Database for English

WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.

Bag of Tricks for Efficient Text Classification

A simple and efficient baseline for text classification is explored that shows that the fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

Leveraging the BERT algorithm for Patents with TensorFlow and BigQuery

In less technical terms, the BERT framework is exceptional at capturing the fact that the meaning of a word can vary significantly based on the context in which it’s used, even in the same document or sentence.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

Overall, BioASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.

Automatically Constructing a Corpus of Sentential Paraphrases

The creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase, is described.