A Survey of Text Similarity Approaches

  title={A Survey of Text Similarity Approaches},
  author={Wael Hassan Gomaa and Aly A. Fahmy},
  journal={International Journal of Computer Applications},
  • W. H. Gomaa, A. Fahmy
  • Published 18 April 2013
  • Computer Science
  • International Journal of Computer Applications
ABSTRACT Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of… 

Figures from this paper

Textual Similarity Measurement Approaches: A Survey
An overview of the textual similarity in the literature is provided and many approaches for measuring textual similarity for Arabic text reviewed and compared in this paper.
Text Similarity Based on Modified LSA Technique
Two approaches are focuses on the problem of the semantic similarities between texts in English language by using Latent Semantic Analysis (LSA) technique, trying to enhance the process of finding the semantic similarity distance between texts and making it more adaptable for both long documents and short sentences.
Short text similarity measurement methods: a review
This paper reviews the research literature on short text similarity (STS) measurement method to classify and give a broad overview of existing techniques, find out its strengths and weaknesses in terms of the domain the independence, language independence, requirement of semantic knowledge, corpus and training data, ability to identify semantic meaning, word order similarity and polysemy.
Measuring Sentences Similarity: A Survey
  • M. Farouk
  • Computer Science
    Indian Journal of Science and Technology
  • 2019
Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity, but structure based similarity that measures similarity between sentences structures needs more investigation.
Assessing semantic similarity of texts - Methods and algorithms
The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined and provides for reducing the dimensionality of the document vector space and better capturing the text semantics.
Comparative Study of Techniques used for Word and Sentence Similarity
  • Farooq Ahmad, Mohd. Faisal
  • Computer Science
    2021 8th International Conference on Computing for Sustainable Global Development (INDIACom)
  • 2021
The approaches to measuring the resemblance of sentences based on the methods implemented are classified into three groups, with word-to-word based, structure-based, and vector-based methods the most frequently used.
Various regression techniques of supervised model used to analyze the impact of syntactic and semantic features in calculating the degree of semantic equivalence between two text fragments, though the sentence pair has different words are described.
A German Corpus for Text Similarity Detection Tasks
A textual German corpus for similarity detection is presented to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences.
A Comparison of Semantic Similarity Methods for Maximum Human Interpretability
Three different methods that not only focus on the text's words but also incorporates semantic information of texts in their feature vector and computes semantic similarities are presented, which performed best in finding similarities between short news texts.
Performance evaluation of similarity measures on similar and dissimilar text retrieval
  • V. Thompson, C. Panchev, M. Oakes
  • Computer Science
    2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)
  • 2015
This paper evaluated the performances of eight popular similarity measures on four levels (degree) of textual similarity using a corpus of plagiarised texts, and showed that most of the measures were equal on highly similar texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer performances.


Sentence similarity based on semantic nets and corpus statistics
Experiments demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition and can be used in a variety of applications that involve text knowledge representation and discovery.
Corpus-based and Knowledge-based Measures of Text Semantic Similarity
This paper shows that the semantic similarity method out-performs methods based on simple lexical matching, resulting in up to 13% error rate reduction with respect to the traditional vector-based similarity metric.
Semantic text similarity using corpus-based word similarity and string similarity
We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence
WordNet : an electronic lexical database
The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
A new corpus-based method, called Second Order Co-occurrencePMI (SOC-PMI), uses Pointwise Mutual Information to sort lists of important neighbor words of the two target words to calculate the relative semantic similarity.
A Wikipedia-Based Multilingual Retrieval Model
Results are presented of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked.
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the
IRIT: Textual Similarity Combining Conceptual Similarity with an N-Gram Comparison Method
The participation of the IRIT team to SemEval 2012 Task 6 (Semantic Textual Similarity) consists of a n-gram based comparison method combined with a conceptual similarity measure that uses WordNet to calculate the similarity between a pair of concepts.
Term representation with Generalized Latent Semantic Analysis
This paper presents Generalized Latent Semantic Analysis as a framework for computing semantically motivated term and document vectors and demonstrates that GLSA term vectors efficiently capture semantic relations between terms and outperform related approaches on the synonymy test.