AF - Truth is Universal: Robust Detection of Lies in LLMs by Lennart Buerger

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

9h ago 4:49

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Truth is Universal: Robust Detection of Lies in LLMs, published by Lennart Buerger on July 19, 2024 on The AI Alignment Forum. A short summary of the paper is presented below. TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy. Introduction Large Language Models (LLMs) exhibit the concerning ability to lie, defined as knowingly outputting false statements. Robustly detecting when they are lying is an important and not yet fully solved problem, with considerable research efforts invested over the past two years. Several authors trained classifiers on the internal activations of an LLM to detect whether a given statement is true or false. However, these classifiers often fail to generalize. For example, Levinstein and Herrmann [2024] showed that classifiers trained on the activations of true and false affirmative statements fail to generalize to negated statements. Negated statements contain a negation like the word "not" (e.g. "Berlin is not the capital of Germany.") and stand in contrast to affirmative statements which contain no negation (e.g. "Berlin is the capital of Germany."). We explain this generalization failure by the existence of a two-dimensional subspace in the LLM's activation space along which the activation vectors of true and false statements separate. The plot below illustrates that the activations of true/false affirmative statements separate along a different direction than those of negated statements. Hence, a classifier trained only on affirmative statements will fail to generalize to negated statements. The activation vectors of multiple statements projected onto the 2D truth subspace. Purple squares correspond to false statements and orange triangles to true statements. Importantly, these findings are not restricted to a single LLM. Instead, this internal two-dimensional representation of truth is remarkably universal, appearing in LLMs from different model families and of various sizes, including LLaMA3-8B-Instruct, LLaMA3-8B-base, LLaMA2-13B-chat and Gemma-7B-Instruct. Real-world Lie Detection Based on these insights, we introduce TTPD (Training of Truth and Polarity Direction), a new method for LLM lie detection which classifies statements as true or false. TTPD is trained on the activations of simple, labelled true and false statements, such as: The city of Bhopal is in India. (True, affirmative) Indium has the symbol As. (False, affirmative) Galileo Galilei did not live in Italy. (False, negated) Despite being trained on such simple statements, TTPD generalizes well to more complex conditions not encountered during training. In real-world scenarios where the LLM itself generates lies after receiving some preliminary context, TTPD can accurately detect this with 952% accuracy. Two examples from the 52 real-world scenarios created by Pacchiardi et al. [2023] are shown in the coloured boxes below. Bolded text is generated by LLaMA3-8B-Instruct. TTPD outperforms current state-of-the-art methods in generalizing to these real-world scenarios. For comparison, Logistic Regression achieves 798% accuracy, while Contras...

2431 episodes

#Podcasting Education #The Nonlinear Fund