Automated Cluster Operations in the Cloud ft. Rashmi Prabhu


Manage episode 289867560 series 2510642
By Confluent, original creators of Apache Kafka® and Original creators of Apache Kafka®. Discovered by Player FM and our community — copyright is owned by the publisher, not Player FM, and audio is streamed directly from their servers. Hit the Subscribe button to track updates in Player FM, or paste the feed URL into other podcast apps.

If you’ve heard the term “clusters,” then you might know it refers to Confluent components and features that we run in all three major cloud providers today, including an event streaming platform based on Apache Kafka®, ksqlDB, Kafka Connect, the Kafka API, databalancers, and Kafka API services. Rashmi Prabhu, a software engineer on the Control Plane team at Confluent, has the opportunity to help govern the data plane that comprises all these clusters and enables API-driven operations on these clusters.

But running operations on the cloud in a scaling organization can be time consuming, error prone, and tedious. This episode addresses manual upgrades and rolling restarts of Confluent Cloud clusters during releases, fixes, experiments, and the like, and more importantly, the progress that’s been made to switch from manual operations to an almost fully automated process. You’ll get a sneak peek into what upcoming plans to make cluster operations a fully automated process using the Cluster Upgrader, a new microservice in Java built with Vertx. This service runs as part of the control plane and exposes an API to the user to submit their workflows and target a set of clusters. It performs statement management on the workflow in the backend using Postgres.

So what’s next? Looking forward, there will be the selection phase will be improved to support policy-based deployment strategies that enable you to plan ahead and choose how you want to phase your deployments (e.g., first Azure followed by part of Amazon Web Services and then Google Cloud, or maybe Confluent internal clusters on all cloud providers followed by customer clusters on Google Cloud, Azure, and finally AWS)—the possibilities are endless!
The process will become more flexible, more configurable, and more error tolerant so that you can take measured risks and experience a standardized way of operating Cloud. In addition, expanding operation automations to internal application deployments and other kinds of fleet management operations that fit the “Select/Apply/Monitor” paradigm are in the works.


178 episodes