Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs

Document Type

Conference Proceeding

Publication Date

3-3-2026

Department

Department of Electrical and Computer Engineering

Abstract

Large Language Models (LLMs) demonstrate impressive capabilities across many applications but remain vulnerable to jailbreak attacks, which elicit harmful or unintended content. While model fine-tuning is an option for safety alignment, it is costly and prone to catastrophic forgetting. Prompt optimization has emerged as a promising alternative, yet existing prompt-based defenses typically rely on static modifications (e.g., fixed prefixes or suffixes) that cannot adapt to diverse and evolving attacks.

We propose Dynamic Deep Prompt Optimization (DDPO), the first jailbreak defense based on deep prompt optimization. DDPO uses the target LLM’s own intermediate layers as feature extractors to dynamically generate defensive embeddings via a lightweight multilayer perceptron. These tailored embeddings are then injected into a subsequent intermediate layer, enabling an input-dependent defense without modifying the LLM’s weights. This design ensures high adaptability with minimal computational overhead.

Experiments on a diverse set of models and attacks demonstrate that DDPO significantly outperforms static prompt optimization methods, particularly on weakly aligned models and when handling semantically ambiguous benign prompts, successfully distinguishing them from genuinely harmful requests.

Publisher's Statement

This paper was presented at the 40th AAAI Conference on Artificial Intelligence (AAAI-26) and accepted for publication.

Publication Title

Proceedings of the 40th AAAI Conference on Artificial Intelligence

Recommended Citation

Obidov, D., Yu, H., Guo, X., & Yang, K. (2026). Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs. Proceedings of the 40th AAAI Conference on Artificial Intelligence.
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2359

Michigan Tech Publications

Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Included in

LINKS

Browse

Search

Graduate Students

Author Corner

Michigan Tech Publications

Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs

Authors

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Included in

Share

LINKS

Browse

Search

Graduate Students

Author Corner