Go offline with the Player FM app!
AF - Different senses in which two AIs can be "the same" by Vivek Hebbar
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 425274837 series 2997284
Sometimes people talk about two AIs being "the same" or "different" AIs. We think the intuitive binary of "same vs. different" conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don't think anything here is particularly novel; we wrote this post because we think it's probably mildly helpful for people to think this through, and because Buck fairly often wants to reference it.
Some example distinctions:
In a debate protocol[1], people often refer to the judge and debaters as three "different" AIs because their tokens are subject to different training incentives. But you could also say that the two debaters are "the same AI" because their prompts and context look similar. Or you could say that all three are the "same AI" because they were fine-tuned from the same weights.
In a control setting, an untrusted monitor might be fine-tuned from the same base model as the model it's monitoring, leaving it intuitively unclear whether they are "the same AI".
We think that it's often important to be able to distinguish different senses of "personal identity" when discussing various issues in AI safety, especially issues related to collusion. We frame each sense of personal identity as telling us whether two actions[2] belong to the "same AI" or "different AIs". Here are the senses we distinguish in this post:
1. Model weights: Were the two actions produced by two different models?
1. Pretrain identity: Models are different if they come from separate pretraining runs
2. Parameter identity: Models are different even if they are fine-tunings of the same pretrained model
2. Shared context: Do the two actions have shared LLM context?
3. Shared activations: Are the two actions downstream of shared activations?
4. Shared memory: In a scaffold, did the processes producing the two actions have access to a shared memory bank?
5. Shared reward: In an RL setting, do the actions share the same rewards?
6. Shared role in training: In protocols like debate, the training incentive can be different in nature for different classes of prompts/contexts. Do the two actions come from contexts in the same class?
Model identity
Model identity is based on which model parameters were run to produce a given action. "Pretrain identity" conflates all models finetuned from the same pretraining run. Whereas "parameter identity" requires the weights to be identical.
Some reasons to care about model identity:
The presence of issues such as scheming[3] is more correlated across models with shared training history
Models with shared training history might have similar misaligned goals, making them more inclined to coordinate
Similarities between their internal ontologies might make it easier to coordinate on Schelling signals or steganographically encode information
Shared activations & shared context
For "shared activations", we consider two actions to belong to the same AI when they are downstream of the same activations. For a deterministic[4] unidirectional transformer, two actions share activations if they are generated by the same model and their contexts share a common prefix. However, future architectures might break the close correspondence between shared activations and shared context.
For example, consider this setup: An LLM sees [prompt 1], generates some [python code], then [prompt 2] is appended instructing the same LLM to critique the code for security flaws, and it generates [critique]. If the activations on the [prompt 1] tokens contain a signal saying that it's time for coordinated defection, the model can see those activations both while generating the code and while generating the critique.
If we ins...
2447 episodes
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 425274837 series 2997284
Sometimes people talk about two AIs being "the same" or "different" AIs. We think the intuitive binary of "same vs. different" conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don't think anything here is particularly novel; we wrote this post because we think it's probably mildly helpful for people to think this through, and because Buck fairly often wants to reference it.
Some example distinctions:
In a debate protocol[1], people often refer to the judge and debaters as three "different" AIs because their tokens are subject to different training incentives. But you could also say that the two debaters are "the same AI" because their prompts and context look similar. Or you could say that all three are the "same AI" because they were fine-tuned from the same weights.
In a control setting, an untrusted monitor might be fine-tuned from the same base model as the model it's monitoring, leaving it intuitively unclear whether they are "the same AI".
We think that it's often important to be able to distinguish different senses of "personal identity" when discussing various issues in AI safety, especially issues related to collusion. We frame each sense of personal identity as telling us whether two actions[2] belong to the "same AI" or "different AIs". Here are the senses we distinguish in this post:
1. Model weights: Were the two actions produced by two different models?
1. Pretrain identity: Models are different if they come from separate pretraining runs
2. Parameter identity: Models are different even if they are fine-tunings of the same pretrained model
2. Shared context: Do the two actions have shared LLM context?
3. Shared activations: Are the two actions downstream of shared activations?
4. Shared memory: In a scaffold, did the processes producing the two actions have access to a shared memory bank?
5. Shared reward: In an RL setting, do the actions share the same rewards?
6. Shared role in training: In protocols like debate, the training incentive can be different in nature for different classes of prompts/contexts. Do the two actions come from contexts in the same class?
Model identity
Model identity is based on which model parameters were run to produce a given action. "Pretrain identity" conflates all models finetuned from the same pretraining run. Whereas "parameter identity" requires the weights to be identical.
Some reasons to care about model identity:
The presence of issues such as scheming[3] is more correlated across models with shared training history
Models with shared training history might have similar misaligned goals, making them more inclined to coordinate
Similarities between their internal ontologies might make it easier to coordinate on Schelling signals or steganographically encode information
Shared activations & shared context
For "shared activations", we consider two actions to belong to the same AI when they are downstream of the same activations. For a deterministic[4] unidirectional transformer, two actions share activations if they are generated by the same model and their contexts share a common prefix. However, future architectures might break the close correspondence between shared activations and shared context.
For example, consider this setup: An LLM sees [prompt 1], generates some [python code], then [prompt 2] is appended instructing the same LLM to critique the code for security flaws, and it generates [critique]. If the activations on the [prompt 1] tokens contain a signal saying that it's time for coordinated defection, the model can see those activations both while generating the code and while generating the critique.
If we ins...
2447 episodes
Alle episoder
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.