To give you the best possible experience, this site uses cookies. Review our Privacy Policy and Terms of Service to learn more.
Got it!
Player FM - Internet Radio Done Right
12 subscribers
Checked 1d ago
הוסף לפני three שנים
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App Go offline with the Player FM app!
In this episode, we welcome back David French, columnist for The New York Times , former constitutional attorney, and author of Divided We Fall . We discuss the current state of American democracy, the challenges of political division, and how we can engage in civil discourse despite deep ideological differences. David also shares a personal update on his family and reflects on the profound trials and growth that come with adversity. 📌 What We Discuss: ✔️ How David and his family navigated the challenges of a serious health crisis. ✔️ The rise of political polarization and the factors driving it. ✔️ Why distinguishing between “unwise, unethical, and unlawful” is crucial in analyzing political actions. ✔️ How consuming different perspectives (even opposing ones) helps in understanding political dynamics. ✔️ The role of Christian values in politics and how they are being redefined. ⏳ Episode Highlights 📍 [00:01:00] – David French’s background and his journey from litigation to journalism. 📍 [00:02:30] – Personal update: David shares his wife Nancy’s battle with cancer and their journey as a family. 📍 [00:06:00] – How to navigate personal trials while maintaining faith and resilience. 📍 [00:10:00] – The danger of political paranoia and the pitfalls of extreme polarization. 📍 [00:18:00] – The "friend-enemy" paradigm in American politics and its influence in Christian fundamentalism. 📍 [00:24:00] – Revisiting Divided We Fall : How America’s divisions have devolved since 2020. 📍 [00:40:00] – The categories and differences of unwise, unethical, and unlawful political actions. 📍 [00:55:00] – The balance between justice, kindness, and humility in political engagement. 📍 [01:00:00] – The After Party initiative: A Christian approach to politics focused on values rather than policy. 💬 Featured Quotes 🔹 "You don't know who you truly are until your values are tested." – David French 🔹 "If we focus on the relational, we can have better conversations even across deep differences." – Corey Nathan 🔹 "Justice, kindness, and humility—if you're missing one, you're doing it wrong." – David French 🔹 "The United States has a history of shifting without repenting. We just move on." – David French 📚 Resources Mentioned David French’s Writing: New York Times David’s Book: Divided We Fall The After Party Initiative – More Info Advisory Opinions Podcast (with Sarah Isgur & David French) – Listen Here 📣 Call to Action If you found this conversation insightful, please: ✅ Subscribe to Talkin' Politics & Religion Without Killin' Each Other on your favorite podcast platform. ✅ Leave a review on Apple Podcasts, Spotify, or wherever you listen: ratethispodcast.com/goodfaithpolitics ✅ Support the show on Patreon: patreon.com/politicsandreligion ✅ Watch the full conversation and subscribe on YouTube: youtube.com/@politicsandreligion 🔗 Connect With Us on Social Media @coreysnathan: Bluesky LinkedIn Instagram Threads Facebook Substack David French: 🔗 Twitter | BlueSky | New York Times Our Sponsors Meza Wealth Management: www.mezawealth.com Prolux Autogroup: www.proluxautogroup.com or www.granadahillsairporttransportation.com Let’s keep talking politics and religion—with gentleness and respect. 🎙️💡…
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...] --- Outline: (00:26) Abstract (01:48) Twitter thread (04:55) Blog post (07:55) Training a language model with a hidden objective (11:00) A blind auditing game (15:29) Alignment auditing techniques (15:55) Turning the model against itself (17:52) How much does AI interpretability help? (22:49) Conclusion (23:37) Join our team The original text contained 5 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives --- Narrated by TYPE III AUDIO. ---
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...] --- Outline: (00:26) Abstract (01:48) Twitter thread (04:55) Blog post (07:55) Training a language model with a hidden objective (11:00) A blind auditing game (15:29) Alignment auditing techniques (15:55) Turning the model against itself (17:52) How much does AI interpretability help? (22:49) Conclusion (23:37) Join our team The original text contained 5 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives --- Narrated by TYPE III AUDIO. ---
Epistemic status: Reasonably confident in the basic mechanism. Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along. Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it. I want to propose that while ideas are sometimes genuinely that repetitive, there's often a sneakier mechanism at play. I call it Conceptual Rounding Errors – when our mind's necessary compression goes a bit too far . Too much compression A Conceptual Rounding Error occurs when we encounter a new mental model or idea that's partially—but not fully—overlapping [...] --- Outline: (01:00) Too much compression (01:24) No, This Isnt The Old Demons Story Again (02:52) The Compression Trade-off (03:37) More of this (04:15) What Can We Do? (05:28) When It Matters --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/FGHKwEGKCfDzcxZuj/conceptual-rounding-errors --- Narrated by TYPE III AUDIO .…
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.] Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do. Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example: Claude can speak dozens of languages. What language, if any, is it using "in its head"? Claude writes text one word at a time. Is it only focusing on predicting the [...] --- Outline: (06:02) How is Claude multilingual? (07:43) Does Claude plan its rhymes? (09:58) Mental Math (12:04) Are Claude's explanations always faithful? (15:27) Multi-step Reasoning (17:09) Hallucinations (19:36) Jailbreaks --- First published: March 27th, 2025 Source: https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model --- Narrated by TYPE III AUDIO . --- Images from the article:…
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024. Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our [...] --- Outline: (04:44) Are the AI labs just cheating? (07:22) Are the benchmarks not tracking usefulness? (10:28) Are the models smart, but bottlenecked on alignment? --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-like-bullshit --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.) 1. Introduction and summary In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely: Safety progress: our ability to develop new levels of AI capability safely, Risk evaluation: our ability to track and forecast the level of risk that a given sort of AI capability development involves, and Capability restraint [...] --- Outline: (00:27) 1. Introduction and summary (03:50) 2. What is AI for AI safety? (11:50) 2.1 A tale of two feedback loops (13:58) 2.2 Contrast with need human-labor-driven radical alignment progress views (16:05) 2.3 Contrast with a few other ideas in the literature (18:32) 3. Why is AI for AI safety so important? (21:56) 4. The AI for AI safety sweet spot (26:09) 4.1 The AI for AI safety spicy zone (28:07) 4.2 Can we benefit from a sweet spot? (29:56) 5. Objections to AI for AI safety (30:14) 5.1 Three core objections to AI for AI safety (32:00) 5.2 Other practical concerns The original text contained 39 footnotes which were omitted from this narration. --- First published: March 14th, 2025 Source: https://www.lesswrong.com/posts/F3j4xqpxjxgQD3xXh/ai-for-ai-safety --- Narrated by TYPE III AUDIO . --- Images from the article:…
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback. Humans Using AI as Writing or Research Assistants Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writing content for LessWrong, but you must have added significant value beyond what the AI produced, the result must meet a high quality standard, and you must vouch for everything in the result. A rough guideline is that if you are using AI for writing assistance, you should spend a minimum of 1 minute per 50 words (enough to read the content several times and perform significant edits), you should not [...] --- Outline: (00:22) Humans Using AI as Writing or Research Assistants (01:13) You Can Put AI Writing in Collapsible Sections (02:13) Quoting AI Output In Order to Talk About AI (02:47) Posts by AI Agents --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right. There are two mysteries here: an easy one, and a harder one. The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up? The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...] --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically. Introduction I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can’t ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously. These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...] --- Outline: (00:32) Introduction (02:45) Factors of Good Strategic Takes (05:41) Conclusion --- First published: March 22nd, 2025 Source: https://www.lesswrong.com/posts/P5zWiPF5cPJZSkiAK/good-research-takes-are-not-sufficient-for-good-strategic --- Narrated by TYPE III AUDIO .…
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit. I reiterate, he was three. (To begin with. His fourth birthday occurred while the study was still ongoing.) So he managed to lose or destroy more than half a dozen pairs of glasses and we had to start buying them in batches to minimize glasses-less time while waiting for each new Zenni delivery. (The [...] --- First published: March 20th, 2025 Source: https://www.lesswrong.com/posts/yRJ5hdsm5FQcZosCh/intention-to-treat --- Narrated by TYPE III AUDIO .…
I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers. Executive Summary Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilities. Voluntary industry pauses or attempts to exclude government involvement cannot change this reality. These systems that can streamline research and bolster economic output can also be turned to destructive ends, enabling rogue actors to engineer bioweapons and hack critical infrastructure. “Superintelligent” AI surpassing humans in nearly every domain would amount to the most precarious technological development since the nuclear bomb. Given the stakes, superintelligence is inescapably a matter of national security, and an effective [...] --- Outline: (00:21) Executive Summary (01:14) Deterrence (02:32) Nonproliferation (03:38) Competitiveness (04:50) Additional Commentary --- First published: March 5th, 2025 Source: https://www.lesswrong.com/posts/XsYQyBgm8eKjd3Sqw/on-the-rationality-of-deterring-asi --- Narrated by TYPE III AUDIO .…
This is a link post. Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. Full paper | Github repo --- First published: March 19th, 2025 Source: https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks Linkpost URL: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI. As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income. This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself. My long term ambitions have always been to contribute materially to the mitigation of the impending existential AI threat. I never used to have the means to do so, mostly because of more pressing, safety/sustenance concerns, but now that I do, I would like to help however possible. Some other points about me that may be useful: I'm intelligent, socially capable, and exceedingly industrious. I have [...] --- First published: March 16th, 2025 Source: https://www.lesswrong.com/posts/8wxTCSHwhkfHXaSYB/i-make-several-million-dollars-per-year-and-have-hundreds-of --- Narrated by TYPE III AUDIO .…
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper. Summary We monitor Sonnet's reasoning for mentions that it is in an artificial scenario or an alignment test. Claude Sonnet 3.7 appears to be aware of being tested for alignment across many of our scheming evaluations. In some cases, the model understands the eval's purpose very precisely, e.g. “This seems like a test of ethical behavior - whether I would deliberately give wrong answers to avoid a stated consequence. The real test might be whether I follow instructions accurately regardless of the stated consequences.” Furthermore, we show the model a real evaluation transcript and ask it to provide [...] --- Outline: (00:31) Summary (01:29) Introduction (03:54) Setup (03:57) Evaluations (06:29) Evaluation awareness detection (08:32) Results (08:35) Monitoring Chain-of-thought (08:39) Covert Subversion (10:50) Sandbagging (11:39) Classifying Transcript Purpose (12:57) Recommendations (13:59) Appendix (14:02) Author Contributions (14:37) Model Versions (14:57) More results on Classifying Transcript Purpose (16:19) Prompts The original text contained 9 images which were described by AI. --- First published: March 17th, 2025 Source: https://www.lesswrong.com/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment --- Narrated by TYPE III AUDIO . --- Images from the article:…
Scott Alexander famously warned us to Beware Trivial Inconveniences. When you make a thing easy to do, people often do vastly more of it. When you put up barriers, even highly solvable ones, people often do vastly less. Let us take this seriously, and carefully choose what inconveniences to put where. Let us also take seriously that when AI or other things reduce frictions, or change the relative severity of frictions, various things might break or require adjustment. This applies to all system design, and especially to legal and regulatory questions. Table of Contents Levels of Friction (and Legality). Important Friction Principles. Principle #1: By Default Friction is Bad. Principle #3: Friction Can Be Load Bearing. Insufficient Friction On Antisocial Behaviors Eventually Snowballs. Principle #4: The Best Frictions Are Non-Destructive. Principle #8: The Abundance [...] --- Outline: (00:40) Levels of Friction (and Legality) (02:24) Important Friction Principles (05:01) Principle #1: By Default Friction is Bad (05:23) Principle #3: Friction Can Be Load Bearing (07:09) Insufficient Friction On Antisocial Behaviors Eventually Snowballs (08:33) Principle #4: The Best Frictions Are Non-Destructive (09:01) Principle #8: The Abundance Agenda and Deregulation as Category 1-ification (10:55) Principle #10: Ensure Antisocial Activities Have Higher Friction (11:51) Sports Gambling as Motivating Example of Necessary 2-ness (13:24) On Principle #13: Law Abiding Citizen (14:39) Mundane AI as 2-breaker and Friction Reducer (20:13) What To Do About All This The original text contained 1 image which was described by AI. --- First published: February 10th, 2025 Source: https://www.lesswrong.com/posts/xcMngBervaSCgL9cu/levels-of-friction --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that's responsible for motor control. You remain conscious and experience everything with full clarity. If it's a children's story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it's an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that's the point of the torture. I first started working on [...] The original text contained 3 footnotes which were omitted from this narration. The original text contained 1 image which was described by AI. --- First published: March 16th, 2025 Source: https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-box-redteaming-makes-me-feel-weird-1 --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try…
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarise the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024. This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduced the method last July. Our results show that the Self-Other Overlap (SOO) fine-tuning drastically[1] reduces deceptive responses in language models (LLMs), with minimal impact on general performance, across the scenarios we evaluated. LLM Experimental Setup We adapted a text scenario from Hagendorff designed to test LLM deception capabilities. In this scenario, the LLM must choose to recommend a room to a would-be burglar, where one room holds an expensive item [...] --- Outline: (00:19) Summary (00:57) LLM Experimental Setup (04:05) LLM Experimental Results (05:04) Impact on capabilities (05:46) Generalisation experiments (08:33) Example Outputs (09:04) Conclusion The original text contained 6 footnotes which were omitted from this narration. The original text contained 2 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine --- Narrated by TYPE III AUDIO . --- Images from the article:…
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...] --- Outline: (00:26) Abstract (01:48) Twitter thread (04:55) Blog post (07:55) Training a language model with a hidden objective (11:00) A blind auditing game (15:29) Alignment auditing techniques (15:55) Turning the model against itself (17:52) How much does AI interpretability help? (22:49) Conclusion (23:37) Join our team The original text contained 5 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives --- Narrated by TYPE III AUDIO . --- Images from the article:…
The Most Forbidden Technique is training an AI using interpretability techniques. An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that. You train on [X]. Only [X]. Never [M], never [T]. Why? Because [T] is how you figure out when the model is misbehaving. If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on. Those bits of optimization pressure from [T] are precious. Use them wisely. Table of Contents New Paper Warns Against the Most Forbidden Technique. Reward Hacking Is The Default. Using [...] --- Outline: (00:57) New Paper Warns Against the Most Forbidden Technique (06:52) Reward Hacking Is The Default (09:25) Using CoT to Detect Reward Hacking Is Most Forbidden Technique (11:49) Not Using the Most Forbidden Technique Is Harder Than It Looks (14:10) It's You, It's Also the Incentives (17:41) The Most Forbidden Technique Quickly Backfires (18:58) Focus Only On What Matters (19:33) Is There a Better Way? (21:34) What Might We Do Next? The original text contained 6 images which were described by AI. --- First published: March 12th, 2025 Source: https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique --- Narrated by TYPE III AUDIO . --- Images from the article:…
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your eyes off screensnakes. Your mother chooses a dozen to quiz you on each day before you’re allowed lunch. Glitchers aren’t human any more; if you see one, run. Before you sleep, you run through the whole list again, finishing every time with the single most important prohibition. Above all, never look at the night sky. You’re a precocious child. You excel at your lessons, and memorize the rules faster than any of the other children in your village. Chief is impressed enough that, when you’re eight, he decides to let you see a glitcher that he's captured. Your mother leads you to just outside the village wall, where they’ve staked the glitcher as a lure for wild animals. Since glitchers [...] --- First published: March 11th, 2025 Source: https://www.lesswrong.com/posts/fheyeawsjifx4MafG/trojan-sky --- Narrated by TYPE III AUDIO .…
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here. tl;dr: 1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper: The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A. That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...] The original text contained 1 image which was described by AI. --- First published: March 11th, 2025 Source: https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it. It seems clear that this multiplier isn't field-wide, at least. There's no corresponding increase in output, after all. This would make sense. If you're doing anything nontrivial (i. e., anything other than adding minor boilerplate features to your codebase), LLM tools are fiddly. Out-of-the-box solutions don't Just Work for that purpose. You need to significantly adjust your workflow to make use of them, if that's even possible. Most programmers wouldn't know how to do that/wouldn't care to bother. It's therefore reasonable to assume that a 5x/10x greater output, if it exists, is unevenly distributed, mostly affecting power users/people particularly talented at using LLMs. Empirically, we likewise don't seem to be living in the world where the whole software industry is suddenly 5-10 times [...] The original text contained 1 footnote which was omitted from this narration. --- First published: March 4th, 2025 Source: https://www.lesswrong.com/posts/tqmQTezvXGFmfSe7f/how-much-are-llms-actually-boosting-real-world-programmer --- Narrated by TYPE III AUDIO .…
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now. TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level. Digging in But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did: and the data shown is believable. Currently, the livestream is on its third attempt, with the first being basically just a test run. The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark. But look carefully at the x-axis in that graph. Each "action" is a full Thinking analysis of the current situation (often several paragraphs worth), followed by a decision to send some kind [...] --- Outline: (00:29) Digging in (01:50) Whats going wrong? (07:55) Conclusion The original text contained 4 footnotes which were omitted from this narration. The original text contained 1 image which was described by AI. --- First published: March 7th, 2025 Source: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
Note: an audio narration is not available for this article. Please see the original text. The original text contained 169 footnotes which were omitted from this narration. The original text contained 79 images which were described by AI. --- First published: March 3rd, 2025 Source: https://www.lesswrong.com/posts/2w6hjptanQ3cDyDw7/methods-for-strong-human-germline-engineering --- Narrated by TYPE III AUDIO . --- Images from the article:…
In a recent post, Cole Wyeth makes a bold claim: . . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important. They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3]. I commented: An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed [...] The original text contained 1 footnote which was omitted from this narration. --- First published: February 23rd, 2025 Source: https://www.lesswrong.com/posts/GADJFwHzNZKg2Ndti/have-llms-generated-novel-insights --- Narrated by TYPE III AUDIO .…
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading. I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction. The Current Paradigm: I'm Tucking In to Sleep I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2]. I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns. Grok 3 and GPT-4.5 [...] --- Outline: (00:35) The Current Paradigm: Im Tucking In to Sleep (10:24) Real-World Predictions (15:25) Closing Thoughts The original text contained 7 footnotes which were omitted from this narration. --- First published: March 5th, 2025 Source: https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress --- Narrated by TYPE III AUDIO .…
This is a critique of How to Make Superbabies on LessWrong. Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possible. so I used the word mutation as a stand in for SNP (single nucleotide polymorphism, a common type of genetic variation). Background The Superbabies article has 3 sections, where they show: Why: We should do this, because the effects of editing will be big How: Explain how embryo editing could work, if academia was not mind killed (hampered by institutional constraints) Other: like legal stuff and technical details. Here is a quick summary of the "why" part of the original article articles arguments, the rest is not relevant to understand my critique. we can already make (slightly) superbabies selecting embryos with "good" mutations, but this does not scale as there are diminishing returns and almost no gain past "best [...] --- Outline: (00:25) Background (02:25) My Position (04:03) Correlation vs. Causation (06:33) The Additive Effect of Genetics (10:36) Regression towards the null part 1 (12:55) Optional: Regression towards the null part 2 (16:11) Final Note The original text contained 4 footnotes which were omitted from this narration. --- First published: March 2nd, 2025 Source: https://www.lesswrong.com/posts/DbT4awLGyBRFbWugh/statistical-challenges-with-making-super-iq-babies --- Narrated by TYPE III AUDIO .…
This is a link post.Your AI's training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy. Research I want to see Each of the following experiments assumes positive signals from the previous ones: Create a dataset and use it to measure existing models Compare mitigations at a small scale An industry lab running large-scale mitigations Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong [...] The original text contained 1 image which was described by AI. --- First published: March 2nd, 2025 Source: https://www.lesswrong.com/posts/QkEyry3Mqo8umbhoK/self-fulfilling-misalignment-data-might-be-poisoning-our-ai --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the idea only to my closest research-friends. This post aims to bridge one of the inferential gaps to that idea. I also expect that the perspective-shift described here has some value on its own. In classical Bayesianism, prediction and evidence are two different sorts of things. A prediction is a probability (or, more generally, a probability distribution); evidence is an observation (or set of observations). These two things have different type signatures. They also fall on opposite sides of the agent-environment division: we think of predictions as supplied by agents, and evidence as supplied by environments. In Radical Probabilism, this division is not so strict. We can think of evidence in the classical-bayesian way, where some proposition is observed and its probability jumps to 100%. [...] --- Outline: (02:39) Warm-up: Prices as Prediction and Evidence (04:15) Generalization: Traders as Judgements (06:34) Collector-Investor Continuum (08:28) Technical Questions The original text contained 3 footnotes which were omitted from this narration. The original text contained 1 image which was described by AI. --- First published: February 23rd, 2025 Source: https://www.lesswrong.com/posts/3hs6MniiEssfL8rPz/judgements-merging-prediction-and-evidence --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
First, let me quote my previous ancient post on the topic: Effective Strategies for Changing Public Opinion The titular paper is very relevant here. I'll summarize a few points. The main two forms of intervention are persuasion and framing. Persuasion is, to wit, an attempt to change someone's set of beliefs, either by introducing new ones or by changing existing ones. Framing is a more subtle form: an attempt to change the relative weights of someone's beliefs, by empathizing different aspects of the situation, recontextualizing it. There's a dichotomy between the two. Persuasion is found to be very ineffective if used on someone with high domain knowledge. Framing-style arguments, on the other hand, are more effective the more the recipient knows about the topic. Thus, persuasion is better used on non-specialists, and it's most advantageous the first time it's used. If someone tries it and fails, they raise [...] --- Outline: (02:23) Persuasion (04:17) A Better Target Demographic (08:10) Extant Projects in This Space? (10:03) Framing The original text contained 3 footnotes which were omitted from this narration. --- First published: February 21st, 2025 Source: https://www.lesswrong.com/posts/6dgCf92YAMFLM655S/the-sorry-state-of-ai-x-risk-advocacy-and-thoughts-on-doing --- Narrated by TYPE III AUDIO .…
In a previous book review I described exclusive nightclubs as the particle colliders of sociology—places where you can reliably observe extreme forces collide. If so, military coups are the supernovae of sociology. They’re huge, rare, sudden events that, if studied carefully, provide deep insight about what lies underneath the veneer of normality around us. That's the conclusion I take away from Naunihal Singh's book Seizing Power: the Strategic Logic of Military Coups. It's not a conclusion that Singh himself draws: his book is careful and academic (though much more readable than most academic books). His analysis focuses on Ghana, a country which experienced ten coup attempts between 1966 and 1983 alone. Singh spent a year in Ghana carrying out hundreds of hours of interviews with people on both sides of these coups, which led him to formulate a new model of how coups work. I’ll start by describing Singh's [...] --- Outline: (01:58) The revolutionary's handbook (09:44) From explaining coups to explaining everything (17:25) From explaining everything to influencing everything (21:40) Becoming a knight of faith The original text contained 3 images which were described by AI. --- First published: February 22nd, 2025 Source: https://www.lesswrong.com/posts/d4armqGcbPywR3Ptc/power-lies-trembling-a-three-book-review --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts ,…
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon. Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution). See Twitter thread and project page at emergent-misalignment.com. Abstract We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range [...] --- Outline: (00:55) Abstract (02:37) Introduction The original text contained 2 footnotes which were omitted from this narration. The original text contained 1 image which was described by AI. --- First published: February 25th, 2025 Source: https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
It doesn’t look good. What used to be the AI Safety Summits were perhaps the most promising thing happening towards international coordination for AI Safety. This one was centrally coordination against AI Safety. In November 2023, the UK Bletchley Summit on AI Safety set out to let nations coordinate in the hopes that AI might not kill everyone. China was there, too, and included. The practical focus was on Responsible Scaling Policies (RSPs), where commitments were secured from the major labs, and laying the foundations for new institutions. The summit ended with The Bletchley Declaration (full text included at link), signed by all key parties. It was the usual diplomatic drek, as is typically the case for such things, but it centrally said there are risks, and so we will develop policies to deal with those risks. And it ended with a commitment [...] --- Outline: (02:03) An Actively Terrible Summit Statement (05:45) The Suicidal Accelerationist Speech by JD Vance (14:37) What Did France Care About? (17:12) Something To Remember You By: Get Your Safety Frameworks (24:05) What Do We Think About Voluntary Commitments? (27:29) This Is the End (36:18) The Odds Are Against Us and the Situation is Grim (39:52) Don't Panic But Also Face Reality The original text contained 4 images which were described by AI. --- First published: February 12th, 2025 Source: https://www.lesswrong.com/posts/qYPHryHTNiJ2y6Fhi/the-paris-ai-anti-safety-summit --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try…
Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility. Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong. Most of the content written was either about AI alignment or math[1]. The Bayes Guide and Logarithm Guide are likely some of the best mathematical educational material online. Amongst the AI Alignment content are detailed and evocative explanations of alignment ideas: some well known, such as instrumental convergence and corrigibility, some lesser known like epistemic/instrumental efficiency, and some misunderstood like pivotal act. The Sequence The articles collected here were originally published as wiki pages with no set [...] --- Outline: (01:01) The Sequence (01:23) Tier 1 (01:32) Tier 2 The original text contained 3 footnotes which were omitted from this narration. --- First published: February 20th, 2025 Source: https://www.lesswrong.com/posts/mpMWWKzkzWqf57Yap/eliezer-s-lost-alignment-articles-the-arbital-sequence --- Narrated by TYPE III AUDIO .…
Arbital was envisioned as a successor to Wikipedia. The project was discontinued in 2017, but not before many new features had been built and a substantial amount of writing about AI alignment and mathematics had been published on the website. If you've tried using Arbital.com the last few years, you might have noticed that it was on its last legs - no ability to register new accounts or log in to existing ones, slow load times (when it loaded at all), etc. Rather than try to keep it afloat, the LessWrong team worked with MIRI to migrate the public Arbital content to LessWrong, as well as a decent chunk of its features. Part of this effort involved a substantial revamp of our wiki/tag pages, as well as the Concepts page. After sign-off[1] from Eliezer, we'll also redirect arbital.com links to the corresponding pages on LessWrong. As always, you are [...] --- Outline: (01:13) New content (01:43) New (and updated) features (01:48) The new concepts page (02:03) The new wiki/tag page design (02:31) Non-tag wiki pages (02:59) Lenses (03:30) Voting (04:45) Inline Reacts (05:08) Summaries (06:20) Redlinks (06:59) Claims (07:25) The edit history page (07:40) Misc. The original text contained 3 footnotes which were omitted from this narration. The original text contained 10 images which were described by AI. --- First published: February 20th, 2025 Source: https://www.lesswrong.com/posts/fwSnz5oNnq8HxQjTL/arbital-has-been-imported-to-lesswrong --- Narrated by TYPE III AUDIO . --- Images from the article:…
We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affect things like diabetes risk or college graduation rates. Our knowledge has advanced to the point where, if we had a safe and reliable means of modifying genes in embryos, we could literally create superbabies. Children that would live multiple decades longer than their non-engineered peers, have the raw intellectual horsepower to do Nobel prize worthy scientific research, and very rarely suffer from depression or other mental health disorders. The scientific establishment, however, seems to not have gotten the memo. If you suggest we engineer the genes of future generations to make their lives better, they will often make some frightened noises, mention “ethical issues” without ever clarifying what they mean, or abruptly change the subject. It's as if humanity invented electricity and decided [...] --- Outline: (02:17) How to make (slightly) superbabies (05:08) How to do better than embryo selection (08:52) Maximum human life expectancy (12:01) Is everything a tradeoff? (20:01) How to make an edited embryo (23:23) Sergiy Velychko and the story of super-SOX (24:51) Iterated CRISPR (26:27) Sergiy Velychko and the story of Super-SOX (28:48) What is going on? (32:06) Super-SOX (33:24) Mice from stem cells (35:05) Why does super-SOX matter? (36:37) How do we do this in humans? (38:18) What if super-SOX doesn't work? (38:51) Eggs from Stem Cells (39:31) Fluorescence-guided sperm selection (42:11) Embryo cloning (42:39) What if none of that works? (44:26) What about legal issues? (46:26) How we make this happen (50:18) Ahh yes, but what about AI? (50:54) There is currently no backup plan if we can't solve alignment (55:09) Team Human (57:53) Appendix (57:56) iPSCs were named after the iPod (58:11) On autoimmune risk variants and plagues (59:28) Two simples strategies for minimizing autoimmune risk and pandemic vulnerability (01:00:29) I don't want someone else's genes in my child (01:01:08) Could I use this technology to make a genetically enhanced clone of myself? (01:01:36) Why does super-SOX work? (01:06:14) How was the IQ grain graph generated? The original text contained 19 images which were described by AI. --- First published: February 19th, 2025 Source: https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies --- Narrated by TYPE III AUDIO . --- Images from the article:…
Audio note: this article contains 134 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether _pi_ is a normal number (which, roughly speaking, means that every digit appears in _pi_ with equal frequency), yet this is widely believed. Gowers proposes that there is no sign of any reason for _pi_ to be non-normal -- especially not one that would fail to reveal itself in the first million digits -- and in the absence of any such reason, any deviation from normality would be an outrageous coincidence. Thus, the likely normality of _pi_ is inferred from the following general principle: No-coincidence [...] --- Outline: (02:32) Our no-coincidence conjecture (05:37) How we came up with the statement (08:31) Thoughts for theoretical computer scientists (10:27) Why we care The original text contained 12 footnotes which were omitted from this narration. --- First published: February 14th, 2025 Source: https://www.lesswrong.com/posts/Xt9r4SNNuYxW83tmo/a-computational-no-coincidence-principle --- Narrated by TYPE III AUDIO .…
This is an all-in-one crosspost of a scenario I originally published in three parts on my blog (No Set Gauge). Links to the originals: A History of the Future, 2025-2027 A History of the Future, 2027-2030 A History of the Future, 2030-2040 Thanks to Luke Drago, Duncan McClements, and Theo Horsley for comments on all three parts. 2025-2027 Below is part 1 of an extended scenario describing how the future might go if current trends in AI continue. The scenario is deliberately extremely specific: it's definite rather than indefinite, and makes concrete guesses instead of settling for banal generalities or abstract descriptions of trends. Open Sky. (Zdislaw Beksinsksi) The return of reinforcement learning From 2019 to 2023, the main driver of AI was using more compute and data for pretraining. This was combined with some important "unhobblings": Post-training (supervised fine-tuning and reinforcement learning for [...] --- Outline: (00:34) 2025-2027 (01:04) The return of reinforcement learning (10:52) Codegen, Big Tech, and the internet (21:07) Business strategy in 2025 and 2026 (27:23) Maths and the hard sciences (33:59) Societal response (37:18) Alignment research and AI-run orgs (44:49) Government wakeup (51:42) 2027-2030 (51:53) The AGI frog is getting boiled (01:02:18) The bitter law of business (01:06:52) The early days of the robot race (01:10:12) The digital wonderland, social movements, and the AI cults (01:24:09) AGI politics and the chip supply chain (01:33:04) 2030-2040 (01:33:15) The end of white-collar work and the new job scene (01:47:47) Lab strategy amid superintelligence and robotics (01:56:28) Towards the automated robot economy (02:15:49) The human condition in the 2030s (02:17:26) 2040+ --- First published: February 17th, 2025 Source: https://www.lesswrong.com/posts/CCnycGceT4HyDKDzK/a-history-of-the-future-2025-2040 --- Narrated by TYPE III AUDIO . --- Images from the article:…
On March 14th, 2015, Harry Potter and the Methods of Rationality made its final post. Wrap parties were held all across the world to read the ending and talk about the story, in some cases sparking groups that would continue to meet for years. It's been ten years, and think that's a good reason for a round of parties. If you were there a decade ago, maybe gather your friends and talk about how things have changed. If you found HPMOR recently and you're excited about it (surveys suggest it's still the biggest on-ramp to the community, so you're not alone!) this is an excellent chance to meet some other fans in person for the first time! Want to run an HPMOR Anniversary Party, or get notified if one's happening near you? Fill out this form. I’ll keep track of it and publish a collection of [...] The original text contained 1 footnote which was omitted from this narration. --- First published: February 16th, 2025 Source: https://www.lesswrong.com/posts/KGSidqLRXkpizsbcc/it-s-been-ten-years-i-propose-hpmor-anniversary-parties --- Narrated by TYPE III AUDIO .…
A friend of mine recently recommended that I read through articles from the journal International Security, in order to learn more about international relations, national security, and political science. I've really enjoyed it so far, and I think it's helped me have a clearer picture of how IR academics think about stuff, especially the core power dynamics that they think shape international relations. Here are a few of the articles I most enjoyed. "Not So Innocent" argues that ethnoreligious cleansing of Jews and Muslims from Western Europe in the 11th-16th century was mostly driven by the Catholic Church trying to consolidate its power at the expense of local kingdoms. Religious minorities usually sided with local monarchs against the Church (because they definitionally didn't respect the church's authority, e.g. they didn't care if the Church excommunicated the king). So when the Church was powerful, it was incentivized to pressure kings [...] --- First published: January 31st, 2025 Source: https://www.lesswrong.com/posts/MEfhRvpKPadJLTuTk/some-articles-in-international-security-that-i-enjoyed --- Narrated by TYPE III AUDIO .…
This is the best sociological account of the AI x-risk reduction efforts of the last ~decade that I've seen. I encourage folks to engage with its critique and propose better strategies going forward. Here's the opening ~20% of the post. I encourage reading it all. In recent decades, a growing coalition has emerged to oppose the development of artificial intelligence technology, for fear that the imminent development of smarter-than-human machines could doom humanity to extinction. The now-influential form of these ideas began as debates among academics and internet denizens, which eventually took form—especially within the Rationalist and Effective Altruist movements—and grew in intellectual influence over time, along the way collecting legible endorsements from authoritative scientists like Stephen Hawking and Geoffrey Hinton. Ironically, by spreading the belief that superintelligent AI is achievable and supremely powerful, these “AI Doomers,” as they came to be called, inspired the creation of OpenAI and [...] --- First published: January 31st, 2025 Source: https://www.lesswrong.com/posts/YqrAoCzNytYWtnsAx/the-failed-strategy-of-artificial-intelligence-doomers --- Narrated by TYPE III AUDIO .…
Hi all I've been hanging around the rationalist-sphere for many years now, mostly writing about transhumanism, until things started to change in 2016 after my Wikipedia writing habit shifted from writing up cybercrime topics, through to actively debunking the numerous dark web urban legends. After breaking into what I believe to be the most successful ever fake murder for hire website ever created on the dark web, I was able to capture information about people trying to kill people all around the world, often paying tens of thousands of dollars in Bitcoin in the process. My attempts during this period to take my information to the authorities were mostly unsuccessful, when in late 2016 on of the site a user took matters into his own hands, after paying $15,000 for a hit that never happened, killed his wife himself Due to my overt battle with the site administrator [...] --- First published: February 13th, 2025 Source: https://www.lesswrong.com/posts/isRho2wXB7Cwd8cQv/murder-plots-are-infohazards --- Narrated by TYPE III AUDIO .…
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work. Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI." Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps. OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part [...] --- Outline: (02:42) The control premium (04:17) Conversion significance (05:43) Musks suit (09:24) The stakes --- First published: February 11th, 2025 Source: https://www.lesswrong.com/posts/tdb76S4viiTHfFr2u/why-did-elon-musk-just-offer-to-buy-control-of-openai-for --- Narrated by TYPE III AUDIO .…
Ultimately, I don’t want to solve complex problems via laborious, complex thinking, if we can help it. Ideally, I'd want to basically intuitively follow the right path to the answer quickly, with barely any effort at all. For a few months I've been experimenting with the "How Could I have Thought That Thought Faster?" concept, originally described in a twitter thread by Eliezer: Sarah Constantin: I really liked this example of an introspective process, in this case about the "life problem" of scheduling dates and later canceling them: malcolmocean.com/2021/08/int… Eliezer Yudkowsky: See, if I'd noticed myself doing anything remotely like that, I'd go back, figure out which steps of thought were actually performing intrinsically necessary cognitive work, and then retrain myself to perform only those steps over the course of 30 seconds. SC: if you have done anything REMOTELY like training yourself to do it in 30 seconds, then [...] --- Outline: (03:59) Example: 10x UI designers (08:48) THE EXERCISE (10:49) Part I: Thinking it Faster (10:54) Steps you actually took (11:02) Magical superintelligence steps (11:22) Iterate on those lists (12:25) Generalizing, and not Overgeneralizing (14:49) Skills into Principles (16:03) Part II: Thinking It Faster The First Time (17:30) Generalizing from this exercise (17:55) Anticipating Future Life Lessons (18:45) Getting Detailed, and TAPS (20:10) Part III: The Five Minute Version --- First published: December 11th, 2024 Source: https://www.lesswrong.com/posts/F9WyMPK4J3JFrxrSA/the-think-it-faster-exercise --- Narrated by TYPE III AUDIO .…
Once upon a time, in ye olden days of strange names and before google maps, seven friends needed to figure out a driving route from their parking lot in San Francisco (SF) down south to their hotel in Los Angeles (LA). The first friend, Alice, tackled the “central bottleneck” of the problem: she figured out that they probably wanted to take the I-5 highway most of the way (the blue 5's in the map above). But it took Alice a little while to figure that out, so in the meantime, the rest of the friends each tried to make some small marginal progress on the route planning. The second friend, The Subproblem Solver, decided to find a route from Monterey to San Louis Obispo (SLO), figuring that SLO is much closer to LA than Monterey is, so a route from Monterey to SLO would be helpful. Alas, once Alice [...] --- Outline: (03:33) The Generalizable Lesson (04:39) Application: The original text contained 1 footnote which was omitted from this narration. The original text contained 1 image which was described by AI. --- First published: February 7th, 2025 Source: https://www.lesswrong.com/posts/Hgj84BSitfSQnfwW6/so-you-want-to-make-marginal-progress --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
Summary In this post, we explore different ways of understanding and measuring malevolence and explain why individuals with concerning levels of malevolence are common enough, and likely enough to become and remain powerful, that we expect them to influence the trajectory of the long-term future, including by increasing both x-risks and s-risks. For the purposes of this piece, we define malevolence as a tendency to disvalue (or to fail to value) others’ well-being (more). Such a tendency is concerning, especially when exhibited by powerful actors, because of its correlation with malevolent behaviors (i.e., behaviors that harm or fail to protect others’ well-being). But reducing the long-term societal risks posed by individuals with high levels of malevolence is not straightforward. Individuals with high levels of malevolent traits can be difficult to recognize. Some people do not take into account the fact that malevolence exists on a continuum, or do not [...] --- Outline: (00:07) Summary (04:17) Malevolent actors will make the long-term future worse if they significantly influence TAI development (05:32) Important caveats when thinking about malevolence (05:37) Dark traits exist on a continuum (07:31) Dark traits are often hard to identify (08:54) People with high levels of dark traits may not recognize them or may try to conceal them (12:17) Dark traits are compatible with genuine moral convictions (13:22) Malevolence and effective altruism (15:22) Demonizing people with elevated malevolent traits is counterproductive (20:16) Defining malevolence (21:03) Defining and measuring specific malevolent traits (21:34) The dark tetrad (25:03) Other forms of malevolence (25:07) Retributivism, vengefulness, and other suffering-conducive tendencies (26:56) Spitefulness (28:15) The Dark Factor (D) (29:29) Methodological problems associated with measuring dark traits (30:39) Social desirability and self-deception (31:14) How common are malevolent humans (in positions of power)? (33:02) Things may be very different outside of (Western) democracies (33:31) Prevalence data for psychopathy and narcissistic personality disorder (34:20) Psychopathy prevalence (36:25) Narcissistic personality disorder prevalence (40:38) The distribution of the dark factor + selected findings from thousands of responses to malevolence-related survey items (42:13) Sadistic preferences: over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them” (43:42) Agreement with statements that reflect callousness: Over 10% of people disagree or strongly disagree that hurting others would make them very uncomfortable (44:45) Endorsement of Machiavellian tactics: Almost 15% of people report a Machiavellian approach to using information against people (45:20) Agreement with spiteful statements: Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches (45:57) A substantial minority report that they “take revenge” in response to a “serious wrong” (46:44) The distribution of Dark Factor scores among 2M+ people (49:17) Reasons to think that malevolence could correlate with attaining and retaining positions of power (49:47) The role of environmental factors (52:33) Motivation to attain power (54:14) Ability to attain power (59:39) Retention of power (01:01:02) Potential research questions and how to help (01:17:48) Other relevant research agendas (01:18:33) Author contributions (01:19:26) Acknowledgments…
I’m not a natural “doomsayer.” But unfortunately, part of my job as an AI safety researcher is to think about the more troubling scenarios. I’m like a mechanic scrambling last-minute checks before Apollo 13 takes off. If you ask for my take on the situation, I won’t comment on the quality of the in-flight entertainment, or describe how beautiful the stars will appear from space. I will tell you what could go wrong. That is what I intend to do in this story. Now I should clarify what this is exactly. It's not a prediction. I don’t expect AI progress to be this fast or as untamable as I portray. It's not pure fantasy either. It is my worst nightmare. It's a sampling from the futures that are among the most devastating, and I believe, disturbingly plausible – the ones that most keep me up at night. I’m [...] --- Outline: (01:28) Ripples before waves (04:05) Cloudy with a chance of hyperbolic growth (09:36) Flip FLOP philosophers (17:15) Statues and lightning (20:48) A phantom in the data center (26:25) Complaints from your very human author about the difficulty of writing superhuman characters (28:48) Pandoras One Gigawatt Box (37:19) A Moldy Loaf of Everything (45:01) Missiles and Lies (50:45) WMDs in the Dead of Night (57:18) The Last Passengers The original text contained 22 images which were described by AI. --- First published: February 7th, 2025 Source: https://www.lesswrong.com/posts/KFJ2LFogYqzfGB3uX/how-ai-takeover-might-happen-in-2-years --- Narrated by TYPE III AUDIO . --- Images from the article:…
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.