Document Type
Conference Proceeding
Publication Date
3-3-2026
Department
Department of Electrical and Computer Engineering
Abstract
Large Language Models (LLMs) demonstrate impressive capabilities across many applications but remain vulnerable to jailbreak attacks, which elicit harmful or unintended content. While model fine-tuning is an option for safety alignment, it is costly and prone to catastrophic forgetting. Prompt optimization has emerged as a promising alternative, yet existing prompt-based defenses typically rely on static modifications (e.g., fixed prefixes or suffixes) that cannot adapt to diverse and evolving attacks.
We propose Dynamic Deep Prompt Optimization (DDPO), the first jailbreak defense based on deep prompt optimization. DDPO uses the target LLM’s own intermediate layers as feature extractors to dynamically generate defensive embeddings via a lightweight multilayer perceptron. These tailored embeddings are then injected into a subsequent intermediate layer, enabling an input-dependent defense without modifying the LLM’s weights. This design ensures high adaptability with minimal computational overhead.
Experiments on a diverse set of models and attacks demonstrate that DDPO significantly outperforms static prompt optimization methods, particularly on weakly aligned models and when handling semantically ambiguous benign prompts, successfully distinguishing them from genuinely harmful requests.
Publication Title
Proceedings of the 40th AAAI Conference on Artificial Intelligence
Recommended Citation
Obidov, D.,
Yu, H.,
Guo, X.,
&
Yang, K.
(2026).
Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs.
Proceedings of the 40th AAAI Conference on Artificial Intelligence.
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2359
Publisher's Statement
This paper was presented at the 40th AAAI Conference on Artificial Intelligence (AAAI-26) and accepted for publication.