• Corpus ID: 219305515

Experiments on Paraphrase Identification Using Quora Question Pairs Dataset

  title={Experiments on Paraphrase Identification Using Quora Question Pairs Dataset},
  author={Andrea Chandra and Ruben Stefanus},
We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance… 

Tables from this paper

Unified Model for Paraphrase Generation and Paraphrase Identification
A light-weight unified model is proposed which aims to solve the problems of Paraphrase Identification and generation by using carefully selected data-points and a fine-tuned T5 model.
Retrieve Synonymous keywords for Frequent Queries in Sponsored Search in a Data Augmentation Way
A data-augmentation-like framework to improve the synonymous retrieval performance for these frequent queries and a commercial Chinese dataset containing 500K synonymous pairs with a precision of 95\% is released to the public for paraphrase study.
SAIF: A Correction-Detection Deep-Learning Architecture for Personal Assistants
A multimodal architecture called SAIF is developed, which detects user corrections, taking as inputs the user’s voice commands as well as their transcripts, and it is shown that SAIF outperforms current state-of-the-art methods on this dataset.


Neural Models for Detecting Binary Semantic Textual Similarity for Algerian and MSA
The results show that relatively simple models composed of 2 LSTM layers outperform by far other more sophisticated attention-based architectures, for both ALG and MSA datasets.
A Hybrid Approach to Paraphrase Detection
This paper presents a hybrid approach to the paraphrase detection task that takes advantage of both feature-engineering and neural-based methods and achieves competitive results.
Machine Learning Models for Paraphrase Identification and its Applications on Plagiarism Detection
Among the compared models, as expected, Recurrent Neural Network is best suited for the paraphrase identification task and it is proposed that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.
Extract, Transform and Filling: A Pipeline Model for Question Paraphrasing based on Template
A pipeline model based on templates for question paraphrasing that outperforms the sequence-to-sequence model in a large margin and the advantage is more promising when the size of training sample is small.
Simple and Effective Paraphrastic Similarity from Parallel Translations
A model and methodology for learning paraphrastic sentence embeddings directly from bitext is presented, removing the time-consuming intermediate step of creating para-phrase corpora and is shown to be orders of magnitude faster than more complex state-of-the-art baselines.
Paraphrasing with Large Language Models
This work presents a useful technique for using a large language model to perform the task of paraphrasing on a variety of texts and subjects and demonstrates to be capable of generating paraphrases not only at a sentence level but also for longer spans of text such as paragraphs without needing to break the text into smaller chunks.
Large Scale Question Paraphrase Retrieval with Smoothed Deep Metric Learning
A new QPR system implemented as a Neural Information Retrieval (NIR) system consisting of a neural network sentence encoder and an approximate k-Nearest Neighbour index for efficient vector retrieval is described.
Retrofitting Contextualized Word Embeddings with Paraphrases
This work proposes a post-processing approach to retrofit the contextualized word embedding with paraphrases, which seeks to minimize the variance of word representations on paraphrased contexts and significantly improves ELMo on various sentence classification and inference tasks.
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
This work uses ParaNMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs, to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.
Exploring Diverse Expressions for Paraphrase Generation
This paper proposes a novel approach with two discriminators and multiple generators to generate a variety of different paraphrases and demonstrates that the model not only gains a significant increase in diversity but also improves generation quality over several state-of-the-art baselines.