TextFlow: A Text Similarity Measure based on Continuous Sequences

  title={TextFlow: A Text Similarity Measure based on Continuous Sequences},
  author={Yassine Mrabet and Halil Kilicoglu and Dina Demner-Fushman},
Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity measures include n-grams and skip-grams overlap which rely on distinct slices of the input texts. In this paper… 

Figures and Tables from this paper

State-of-art: text similarity computing
By comparing several typical models, three key issues about text similarity computing are addressed in details which include text representation model, the similarity calculation and the quality evaluation.
Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings
A near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus is created, used to create several novel implicit word-word and text-text similarity metrics.
HOLMS: Alternative Summary Evaluation with Large Language Models
This paper presents a new hybrid evaluation measure for summarization, called HOLMS, that combines both language models pre-trained on large corpora and lexical similarity measures and shows that HOLMS outperforms ROUGE and BLEU substantially in its correlation with human judgments on several extractive summarization datasets.
Simple Convolutional Neural Networks with Linguistically-Annotated Input for Answer Selection in Question Answering
This research modifications the input that is fed to neural networks by annotating the input with linguistic information: POS tags, Named Entity Recognition output, linguistic relations, etc, and argues that this strikes a better balance between feature vs. network engineering.
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database
A generic platform for biomedical text mining is proposed and described, which can serve as a shared resource for machine learning projects and as a public repository for their outputs and can be extended to include a wide variety of other machine learning-based goals and projects.
Evaluation of Scientific Elements for Text Similarity in Biomedical Publications
Comparison of the tools with two strong baselines shows that the predictions provided by the ArguminSci tool can support the use case of mining alternative methods for animal experiments.
Integration of the PubAnnotation ecosystem in the development of a web-based search tool for alternative methods
  • M. Neves
  • Computer Science
    Genomics & informatics
  • 2020
A Web application to support finding alternative methods to animal experiments and an annotator for cell lines that contain more than 196k terms from Cellosaurus are being developed.
An Architecture for e-Health Recommender Systems Based on Similarity of Patients’ Symptoms
The architecture is able to reach similar cases from the organizational memory based on a textual similarity analysis for limiting the search space and using the International Classification of Diseases is possible to convert a case to a vector model representation in order to compute metric distances.


Sentence similarity based on semantic nets and corpus statistics
Experiments demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition and can be used in a variety of applications that involve text knowledge representation and discovery.
The Evaluation of Sentence Similarity Measures
This work evaluated fourteen existing text similarity measures which have been used to calculate similarity score between sentences in many text applications, and found three of them to be inadequate.
Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning
A new composite similarity metric is presented that combines information from multiple linguistic indicators to measure semantic distance between pairs of small textual units and is evaluated against standard information retrieval techniques, establishing that the new method is more effective in identifying closely related textual units.
Improving Similarity Measures for Short Segments of Text
A Web-relevance similarity measure is introduced and it is shown that one can further improve the accuracy of similarity measures by using a machine learning approach.
Learning Discriminative Projections for Text Similarity Measures
A novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space, which not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions and is thus more efficient.
A web-based kernel function for measuring the similarity of short text snippets
This paper defines a similarity kernel function, mathematically analyze some of its properties, and provides examples of its efficacy, and shows the use of this kernel function in a large-scale system for suggesting related queries to search engine users.
Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks
This paper presents a convolutional neural network architecture for reranking pairs of short texts, where the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data are learned.
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
This work introduces a method for paraphrase detection based on recursive autoencoders (RAE) and unsupervised RAEs based on a novel unfolding objective and learns feature vectors for phrases in syntactic trees to measure word- and phrase-wise similarity between two sentences.
A large annotated corpus for learning natural language inference
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
WordNet: : Similarity - Measuring the Relatedness of Concepts
WordNet::Similarity is a freely available software package that makes it possible to measure the semantic similarity and relatedness between a pair of concepts (or synsets). It provides six measures