Michigan Tech Publications, Part 2

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Evan Lucas, Michigan Technological University
Timothy C. Havens, Michigan Technological University

Document Type

Conference Proceeding

Publication Date

7-2023

Department

College of Computing

Abstract

This work analyzes backdoor watermarks in an autoregressive transformer fine-tuned to perform a generative sequence-to-sequence task, specifically summarization. We propose and demonstrate an attack to identify trigger words or phrases by analyzing open ended generations from autoregressive models that have backdoor watermarks inserted. It is shown in our work that triggers based on random common words are easier to identify than those based on single, rare tokens. The attack proposed is easy to implement and only requires access to the model weights.

Publication Title

Proceedings of the Annual Meeting of the Association for Computational Linguistics

ISBN

9781959429869

Recommended Citation

Lucas, E., & Havens, T. (2023). GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 242-248.
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/291

Link to Full Text

COinS

Michigan Tech Publications, Part 2

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Document Type

Publication Date

Department

Abstract

Publication Title

ISBN

Recommended Citation

LINKS

Browse

Search

Author Corner

Michigan Tech Publications, Part 2

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Authors

Document Type

Publication Date

Department

Abstract

Publication Title

ISBN

Recommended Citation

Share

LINKS

Browse

Search

Author Corner