Artwork

Content provided by Demetrios Brinkmann. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios Brinkmann or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Machine Learning SRE // Niall Murphy // MLOps Coffee Sessions #54

48:40
 
Share
 

Manage episode 313294425 series 3241972
Content provided by Demetrios Brinkmann. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios Brinkmann or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Coffee Sessions #54 with Niall Murphy, Machine Learning SRE.
//Abstract
SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams.
Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
// Bio
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies worldwide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Niall on LinkedIn: https://www.linkedin.com/in/niallm/
Timestamps:
[00:00] Introduction to Niall Murphy
[00:36] SRE background to Machine Learning space transition
[07:10] SLO's being a challenge in the ML space
[09:42] SRE Hiring Investments
[15:10] Behavior of teams concept
[17:45] Challenges dealing with ML production
[18:27] Update on Reliable Machine Learning book
[22:46] Monitoring
[25:05] Difference between ML and SRE
[29:18] Incident response in Machine Learning
[34:46] Rollbacks
[35:50] Machine Learning burden overtime
[42:42] Niall's journey to the SRE space and focus to develop himself

  continue reading

328 episodes

Artwork
iconShare
 
Manage episode 313294425 series 3241972
Content provided by Demetrios Brinkmann. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios Brinkmann or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Coffee Sessions #54 with Niall Murphy, Machine Learning SRE.
//Abstract
SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams.
Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
// Bio
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies worldwide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/
Connect with Niall on LinkedIn: https://www.linkedin.com/in/niallm/
Timestamps:
[00:00] Introduction to Niall Murphy
[00:36] SRE background to Machine Learning space transition
[07:10] SLO's being a challenge in the ML space
[09:42] SRE Hiring Investments
[15:10] Behavior of teams concept
[17:45] Challenges dealing with ML production
[18:27] Update on Reliable Machine Learning book
[22:46] Monitoring
[25:05] Difference between ML and SRE
[29:18] Incident response in Machine Learning
[34:46] Rollbacks
[35:50] Machine Learning burden overtime
[42:42] Niall's journey to the SRE space and focus to develop himself

  continue reading

328 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide