Backdoor Attacks and Defenses in Natural Language Processing

Wencong You

Textual backdoor attacks pose a serious threat to natural language processing (NLP) systems. These attacks corrupt a language model (LM) by inserting malicious "poison" instances during training, which contain specific "triggers". At inference, the poisoned model performs maliciously on any test instance containing the trigger while behaving normally on clean samples. These attacks are stealthy and difficult to detect, as they have minimal impact on the modelâ€™s performance on clean data. In recent years, extensive research has focused on both backdoor attacks and defenses. This paper offers a timely and comprehensive review of the existing work in this field. First, we provide the definition and background of backdoor attacks, and analyze the relation between backdoor attacks and relevant fields. Second, we categorize backdoor attacks and defenses based on attacker capabilities and defense strategies. Third, we summarize the recent progression in adversarial attacks against large language models (LLMs). Additionally, we introduce the commonly used benchmark tasks, datasets, and toolkits. Finally, we outline the open challenges and potential research directions for the future.