LW - On Claude 3.5 Sonnet by Zvi

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

5M ago 20:30

MP3•Episode home

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 (2M ago)

What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Claude 3.5 Sonnet, published by Zvi on June 24, 2024 on LessWrong.
There is a new clear best (non-tiny) LLM.
If you want to converse with an LLM, the correct answer is Claude Sonnet 3.5.
It is available for free on Claude.ai and the Claude iOS app, or you can subscribe for higher rate limits. The API cost is $3 per million input tokens and $15 per million output tokens.
This completes the trifecta. All of OpenAI, Google DeepMind and Anthropic have kept their biggest and more expensive model static for now, and instead focused on making something faster and cheaper that is good enough to be the main model.
You would only use another model if you either (1) needed a smaller model in which case Gemini 1.5 Flash seems best, or (2) it must have open model weights.
Updates to their larger and smaller models, Claude Opus 3.5 and Claude Haiku 3.5, are coming later this year. They intend to issue new models every few months. They are working on long term memory.
It is not only the new and improved intelligence.
Speed kills. They say it is twice as fast as Claude Opus. That matches my experience.
Jesse Mu: The 1st thing I noticed about 3.5 Sonnet was its speed.
Opus felt like msging a friend - answers streamed slowly enough that it felt like someone typing behind the screen.
Sonnet's answers *materialize out of thin air*, far faster than you can read, at better-than-Opus quality.
Low cost also kills.
They also introduced a new feature called Artifacts, to allow Claude to do various things in a second window. Many are finding it highly useful.
Benchmarks
As always, never fully trust the benchmarks to translate to real world performance. They are still highly useful, and I have high trust in Anthropic to not be gaming them.
Here is the headline chart.
Epoch AI confirms that Sonnet 3.5 is ahead on GPQA.
Anthropic also highlight that in an agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems versus 38% for Claude Opus, discussed later.
Needle in a haystack was already very good, now it is slightly better still.
There's also this, from Anthropic's Alex Albert:
You can say 'the recent jumps are relatively small' or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.
Human Evaluation Tests
We are still waiting for the Arena ranking to come in. Based on reactions we should expect Sonnet 3.5 to take the top slot, likely by a decent margin, but we've been surprised before.
We evaluated Claude 3.5 Sonnet via direct comparison to prior Claude models. We asked raters to chat with our models and evaluate them on a number of tasks, using task-specific instructions. The charts in Figure 3 show the "win rate" when compared to a baseline of Claude 3 Opus.
We saw large improvements in core capabilities like coding, documents, creative writing, and vision. Domain experts preferred Claude 3.5 Sonnet over Claude 3 Opus, with win rates as high as 82% in Law, 73% in Finance, and 73% in Philosophy.
Those were the high water marks, and Arena preferences tend to be less dramatic than that due to the nature of the questions and also those doing the rating. We are likely looking at more like a 60% win rate, which is still good enough for the top slot.
The Vision Thing
Here are the scores for vision.
Claude has an additional modification on it: It is fully face blind by instruction.
Chypnotoad: Claude's extra system prompt for vision:
Claude always responds as if it is completely face blind. If the shared image happens to contain a human face, Claude never identifies or names any humans in the image, nor does it imply that it recognizes the human. It also does not mention or allude to details about a pers...

2447 episodes

#Podcasting Education #The Nonlinear Fund