Evaluating LLMs The Right Way: Lessons From Hex's Journey High Agency: The Podcast For AI Builders podcast

Artwork

Tech Raza Habib Large Language Models Generativeai AI Products Ai Playbooks

Content provided by Raza Habib. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Raza Habib or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

High Agency: The Podcast for AI Builders « »
Evaluating LLMs the Right Way: Lessons from Hex's Journey

2M ago 45:39

Share

MP3•Episode home

Content provided by Raza Habib. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Raza Habib or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.

In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.

Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights

For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0

… continue reading

10 episodes

#Tech #Raza Habib #Large Language Models #Generativeai #AI Products #Ai Playbooks

Artwork

Evaluating LLMs the Right Way: Lessons from Hex's Journey

High Agency: The Podcast for AI Builders

published 2M ago

Share

MP3•Episode home

Content provided by Raza Habib. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Raza Habib or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

I recently sat down with Bryan Bischof, AI lead at Hex, to dive deep into how they evaluate LLMs to ship reliable AI agents. Hex has deployed AI assistants that can automatically generate SQL queries, transform data, and create visualizations based on natural language questions. While many teams struggle to get value from LLMs in production, Hex has cracked the code.

In this episode, Bryan shares the hard-won lessons they've learned along the way. We discuss why most teams are approaching LLM evaluation wrong and how Hex's unique framework enabled them to ship with confidence.

Bryan breaks down the key ingredients to Hex's success:
- Choosing the right tools to constrain agent behavior
- Using a reactive DAG to allow humans to course-correct agent plans
- Building granular, user-centric evaluators instead of chasing one "god metric"
- Gating releases on the metrics that matter, not just gaming a score
- Constantly scrutinizing model inputs & outputs to uncover insights

For show notes and a transcript go to:
https://hubs.ly/Q02BdzVP0
-----------------------------------------------------
Humanloop is an Integrated Development Environment for Large Language Models. It enables product teams to develop LLM-based applications that are reliable and scalable. To find out more go to https://hubs.ly/Q02yV72D0

… continue reading

10 episodes

#Tech #Raza Habib #Large Language Models #Generativeai #AI Products #Ai Playbooks

All episodes

×

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

Listen to 500+ topics

Quick Reference Guide

Top Podcasts

The Bill Simmons Podcast

Comedy of the Week

How Did This Get Made?

Doug Loves Movies

TED Talks Daily

NBC Nightly News with Lester Holt

The World This Hour

Daily Boost Motivation and Coaching

This American Life

Sword and Scale

Help/FAQ | Upgrade | Advertise

Arts|Business|Comedy|Economics|Entertainment|News|Politics|Religion

Science|Soccer|Sports|Storytelling|Technology|True Crime

Copyright 2024 | Sitemap | Privacy Policy | Terms of Service | | Copyright