Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.
The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack
Marco Bombieri
;Marco Rospocher
2025-01-01
Abstract
Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.File | Dimensione | Formato | |
---|---|---|---|
The_Dangerous_Effects_of_a_Frustratingly_Easy_LLMs_Jailbreak_Attack.pdf
non disponibili
Dimensione
4.17 MB
Formato
Adobe PDF
|
4.17 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.