AF - Epistemic states as a potential benign prior by Tamsin Leake

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

23h ago 13:38

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Epistemic states as a potential benign prior, published by Tamsin Leake on August 31, 2024 on The AI Alignment Forum. Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model: processes in the multiverse containing a specific "beacon" bitstring, processes in the multiverse containing the AI, processes which would output all of my blog, so I can make it output more for me, processes which match an AI chatbot's hypotheses about what it's talking with, then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used). And the solomonoff prior is famously malign. (Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism), but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!) (It does seem plausible that doing geometric expectation over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.) It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever). One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model processes which match an AI chatbot's hypotheses about what it's talking with then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which: is accurate enough to match observations. is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered - i.e. in CEV(user-hypothesis), user-hypothesis builds and then runs CEV(physical-user), because that's what the user would do in such a situation. Let's call this second criterion "cooperating back to the real user". So we need a prior which: Has at least some mass on hypotheses which correspond to observations cooperate back to the real user and can eventually be found by the AI, given enough evidence (enough chatting with the user) Call this the "aligned hypothesis". Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammer-type hardware vulnerabilities, or other failures like that. Formalizing the chatbot model First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle: Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably ...

2442 episodes

#Podcasting Education #The Nonlinear Fund