Generating Label Cohesive and Well-Formed Adversarial Claims

  title={Generating Label Cohesive and Well-Formed Adversarial Claims},
  author={Pepa Atanasova and Dustin Wright and Isabelle Augenstein},
Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they… 

Figures and Tables from this paper

Synthetic Disinformation Attacks on Automated Fact Verification Systems

This work explores the sensitivity of automated fact-checkers to synthetic adversarial evidence in two simulated settings: ADVERSARIAL ADDITION, where documents are fabricate and added to the evidence repository available to the fact-checking system, and ADVERSarIAL MODIFICATION, where existing evidence source documents in the repository are automatically altered.

Universal Adversarial Attacks with Natural Triggers for Text Classification

This work develops adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs, and leverages an adversarially regularized autoencoder to generate triggers and proposes a gradient-based search that aims to maximize the downstream classifier’s prediction loss.

Zero-shot Fact Verification by Claim Generation

QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia, is developed.

Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems

This work proposes an exploratory taxonomy that spans these two targets and the different threat model dimensions, and designs and proposes several potential attack methods, showing that it is possible to subtly modify claim-salient snippets in the evidence, in addition to generating diverse and claim-aligned evidence.

Claim Check-Worthiness Detection as Positive Unlabelled Learning

The best performing method is a unified approach which automatically corrects for this using a variant of positive unlabelled learning that finds instances which were incorrectly labelled as not check-worthy.

Fact Checking with Insufficient Evidence

This work is the first to study what information FC models consider sufficient for FC by introducing a novel task and advancing it with three main contributions, finding that models are least successful in detecting missing evidence when adverbial modifiers are omitted.

Generating Fluent Fact Checking Explanations with Unsupervised Post-Editing

This work presents an iterative edit-based algorithm that uses only phrase-level edits to perform unsupervised post-editing of disconnected RCs and generates explanations that are fluent, readable, non-redundant, and cover important information for the fact check.

Generating Scientific Claims for Zero-Shot Scientific Fact Checking

This work proposes scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrates its usefulness in zero-shot fact checking for biomedical claims, and proposes CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel methods for generating claim negations.

How Robust are Fact Checking Systems on Colloquial Claims?

It is found that existing fact checking systems that perform well on claims in formal style significantly degenerate on colloquial claims with the same semantics, and it is shown that document retrieval is the weakest spot in the system even vulnerable to filler words, such as “yeah” and “you know”.

Stance Detection Benchmark: How Robust Is Your Stance Detection?

A StD benchmark that allows to compare ML models against a wide variety of heterogeneous StD datasets to evaluate them for generalizability and robustness is introduced and emphasizes the need of focus on robustness and de-biasing strategies in multi-task learning approaches.



Universal Adversarial Triggers for Attacking and Analyzing NLP

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a

Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

Evaluating adversarial attacks against multiple fact verification systems

This work evaluates adversarial instances generated by a recently proposed state-of-the-art method, a paraphrasing method, and rule-based attacks devised for fact verification and finds that the rule- based attacks have higher potency and that while the rankings among the top systems changed, they exhibited higher resilience than the baselines.

DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking

This work shows that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduces a resource with these types of claims, and presents a system designed to be resilient to these “attacks”.

On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

A new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account is proposed and it is shown that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance.

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency

A new word replacement order determined by both the wordsaliency and the classification probability is introduced, and a greedy algorithm called probability weighted word saliency (PWWS) is proposed for text adversarial attack.

GEM: Generative Enhanced Model for adversarial attacks

GEM is the extended language model developed upon GPT-2 architecture that was used to create samples awarded the first prize on the FEVER 2.0 Breakers Task and generated malicious claims that mixed facts from various articles, so it became difficult to classify their truthfulness.

Semantics Preserving Adversarial Learning.

This paper proposes an efficient algorithm whereby the semantics of the inputs are leverage as a source of knowledge upon which to impose adversarial constraints, and shows its effectiveness in producing semantics preserving adversarial examples which evade existing defenses against adversarial attacks.

The FEVER2.0 Shared Task

There was a great variety in adversarial attack types as well as the techniques used to generate the attacks, highlighting commonalities and innovations among participating systems.

Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.