On the Use of Similarity Search to Detect Fake Scientific Papers

@inproceedings{Williams2015OnTU,
  title={On the Use of Similarity Search to Detect Fake Scientific Papers},
  author={Kyle Williams and C. Lee Giles},
  booktitle={SISAP},
  year={2015}
}
Fake scientific papers have recently become of interest within the academic community as a result of the identification of fake papers in the digital libraries of major academic publishers [8]. Detecting and removing these papers is important for many reasons. We describe an investigation into the use of similarity search for detecting fake scientific papers by comparing several methods for signature construction and similarity scoring and describe a pseudo-relevance feedback technique that can… 
Detection of Computer-Generated Papers Using One-Class SVM and Cluster Approaches
TLDR
The paper presents a novel methodology intended to distinguish between real and artificially generated manuscripts using a one-class SVM approach compared with a clustering base procedure and suggests that the human style is essentially more “diverse” and “rich” in comparison with an artificial one.
Trends in Gaming Indicators: On Failed Attempts at Deception and their Computerised Detection
TLDR
Through several emblematic case studies, evidences of attempts to game indicators together with automatic ways to detect them are shown (automatic detection of generated papers, errors detection).
Prevalence of nonsensical algorithmically generated papers in the scientific literature
TLDR
This work reveals metric gaming up to the point of absurdity: fraudsters publish nonsensical algorithmically generated papers featuring genuine references and stresses the need to screen papers for nonsense before peer‐review and chase citation manipulation in published papers.
Ike Antkare, His Publications, and Those of His Disciples
When evaluating a scientific paper, footnotes and citations have become crucial tools to quantify academic excellence. This has become an important trend, for several reasons. Metrics have gained
Detection of automatically generated texts
TLDR
This thesis first introduces different methods of generating free texts that resemble a certain topic and how those texts can be used and sheds light on multiple important research questions about the possibility of detecting automatically generated texts in different setting.
Engineering a Tool to Detect Automatically Generated Papers
TLDR
Different methods aiming at automatically classifying generated papers are presented and compared and it is shown that there is a need for an automatic detection process to discover and remove nonsense papers.
Detecting automatically generated sentences with grammatical structure similarity
TLDR
The grammatical structure similarity measurement is presented to detect sentences or short fragments of automatically generated text from known PCFG generators and the proposed approach is tested against a pattern checker and various common machine learning methods.
Curious Cases of Automatically Generated Text and Detecting Probabilistic Context Free Grammar Sentences with Grammatical Structure Similarity
TLDR
The Grammatical Structure Similarity (GSS) measurement is presented to detect sentences or short fragments of automatically generated text from known PCFG generators and is tested against a pattern checker and various common machine learning methods.
Towards the simulation of sensor networks [authenticity and untruthful practice]
This paper will explore the rise of AI generated fake essays and papers created with browser-based software and probe this phenomenon through questionable peer review processes from online journals.
HUSO 2017 Proceedings
  • Xiaolong Jin
  • 2017
From the earliest time of recorded scholarship, forecasting civil strife has been the Holy Grail to political theorists. Yet, without actual data and ability to conduct empirical analyses, until the

References

SHOWING 1-10 OF 10 REFERENCES
Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?
TLDR
This work demonstrates a software method of detecting duplicate and fake publications appearing in scientific conferences and, as a result, in the bibliographic services.
Near duplicate detection in an academic digital library
TLDR
An investigation into the application of scalable simhash and shingle state of the art duplicate detection algorithms for detecting near duplicate documents in the CiteSeerX digital library and evaluated their performance and application to academic documents and identified good parameters for the algorithms.
Detecting near-duplicates for web crawling
TLDR
This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k.
Human-competitive tagging using automatic keyphrase extraction
TLDR
This paper demonstrates how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and improves performance in this new domain using a new algorithm, "Maui", that utilizes semantic information extracted from Wikipedia.
An Effective Method to Identify Machine Automatically Generated Paper
  • Jiping Xiong, Tao Huang
  • Computer Science
    2009 Pacific-Asia Conference on Knowledge Engineering and Software Engineering
  • 2009
TLDR
A simple but effective method to quickly identify whether a paper is from a paper generator or not is introduced and is useful to detect faked paper and can be easily adapted to other related work.
Overview of the 6th International Competition on Plagiarism Detection
TLDR
Thispaper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10, highlighting several important aspects of plagiarism de- tection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length.
Syntactic Clustering of the Web
TLDR
An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Publishers withdraw more than 120 gibberish papers
Conference proceedings removed from subscription databases after scientist reveals that they were computer-generated.
Publish or Perish—An Ailing Enterprise?
T recent events, taking place in rapid succession, incited me to write this Opinion. The first was an annual report from a major school of engineering whose dean proudly listed 52 papers that he
Investigating journals: The dark side of publishing
  • D. Butler
  • Political Science, Medicine
    Nature
  • 2013