LW - SRE's Review Of Democracy By Martin Sustrik The Nonlinear Library podcast

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

The Nonlinear Library »
LW - SRE's review of Democracy by Martin Sustrik

4h ago 5:13

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SRE's review of Democracy, published by Martin Sustrik on August 4, 2024 on LessWrong. Day One We've been handed this old legacy system called "Democracy". It's an emergency. The old maintainers are saying it has been misbehaving lately but they have no idea how to fix it. We've had a meeting with them to find out as much as possible about the system, but it turns out that all the original team members left the company long time ago. The current team doesn't have much understanding of the system beyond some basic operational knowledge. We've conducted a cursory code review, focusing not so much on business logic but rather on the stuff that could possibly help us to tame it: Monitoring, reliability characteristics, feedback loops, automation already in place. Our first impression: Oh, God, is this thing complex! Second impression: The system is vaguely modular. Each module is strongly coupled with every other module though. It's an organically grown legacy system at its worst. That being said, we've found a clue as to why the system may have worked fine for so long. There's a redundancy system called "Separation of Powers". It reminds me of the Tandem computers back from the 70s. Day Two We were wrong. "Separation of Powers" is not a system for redundancy. Each part of the system ("branch") has different business logic. However, each also acts as a watchdog process for the other branches. When it detects misbehavior it tries to apply corrective measures using its own business logic. Gasp! Things are not looking good. We're still searching for monitoring. Day Three Hooray! We've found the monitoring! It turns out that "Election" is conducted once every four years. Each component reports its health (1 bit) to the central location. The data flow is so low that we have overlooked it until now. We are considering shortening the reporting period, but the subsystem is so deeply coupled with other subsystems that doing so could easily lead to a cascading failure. In other news, there seems to be some redundancy after all. We've found a full-blown backup control system ("Shadow Cabinet") that is inactive at the moment, but might be able to take over in case of a major failure. We're investigating further. Day Four Today, we've found yet another monitoring system called "FreePress." As the name suggests it was open-sourced some time ago, but the corporate version have evolved quite a bit since then, so the documentation isn't very helpful. The bad news is that it's badly intertwined with the production system. The metrics look more or less okay as long as everything is working smoothly. However, it's unclear what will happen if things go south. It may distort the metrics or even fail entirely, leaving us with no data whatsoever at the moment of crisis. By the way, the "Election" process may not be a monitoring system after all. I suspect it might actually be a feedback loop that triggers corrective measures in case of problems. Day Five The most important metric seems to be this big graph labeled "GDP". As far as we understand, it's supposed to indicate the overall health of the system. However, drilling into the code suggests that it's actually a throughput metric. If throughput goes down there's certainly a problem, but it's not clear why increasing throughput should be considered the primary health factor... More news on the "Election" subsystem: We've found a floppy disk with the design doc, and it turns out that it's not a feedback loop after all. It's a distributed consensus algorithm (think Paxos)! The historical context is that they've used to run several control systems in parallel (for redundancy reasons maybe?) which resulted in numerous race conditions and outages. "Election" was put in place to ensure that only one control system acts as a master at any given time...

2431 episodes

#Podcasting Education #The Nonlinear Fund