LW - The Great Data Integration Schlep by sarahconstantin

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2d ago 15:19

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Great Data Integration Schlep, published by sarahconstantin on September 13, 2024 on LessWrong.
This is a little rant I like to give, because it's something I learned on the job that I've never seen written up explicitly.
There are a bunch of buzzwords floating around regarding computer technology in an industrial or manufacturing context: "digital transformation", "the Fourth Industrial Revolution", "Industrial Internet of Things".
What do those things really mean?
Do they mean anything at all?
The answer is yes, and what they mean is the process of putting all of a company's data on computers so it can be analyzed.
This is the prerequisite to any kind of "AI" or even basic statistical analysis of that data; before you can start applying your fancy algorithms, you need to get that data in one place, in a tabular format.
Wait, They Haven't Done That Yet?
In a manufacturing context, a lot of important data is not on computers.
Some data is not digitized at all, but literally on paper: lab notebooks, QA reports, work orders, etc.
Other data is is "barely digitized", in the form of scanned PDFs of those documents. Fine for keeping records, but impossible to search, or analyze statistically. (A major aerospace manufacturer, from what I heard, kept all of the results of airplane quality tests in the form of scanned handwritten PDFs of filled-out forms. Imagine trying to compile trends in quality performance!)
Still other data is siloed inside machines on the factory floor. Modern, automated machinery can generate lots of data - sensor measurements, logs of actuator movements and changes in process settings - but that data is literally stored in that machine, and only that machine.
Manufacturing process engineers, for nearly a hundred years, have been using data to inform how a factory operates, generally using a framework known as statistical process control. However, in practice, much more data is generated and collected than is actually used. Only a few process variables get tracked, optimized, and/or used as inputs to adjust production processes; the rest are "data exhaust", to be ignored and maybe deleted.
In principle the "excess" data may be relevant to the facility's performance, but nobody knows how, and they're not equipped to find out.
This is why manufacturing/industrial companies will often be skeptical about proposals to "use AI" to optimize their operations. To "use AI", you need to build a model around a big dataset. And they don't have that dataset.
You cannot, in general, assume it is possible to go into a factory and find a single dataset that is "all the process logs from all the machines, end to end".
Moreover, even when that dataset does exist, there often won't be even the most basic built-in tools to analyze it. In an unusually modern manufacturing startup, the M.O. might be "export the dataset as .csv and use Excel to run basic statistics on it."
Why Data Integration Is Hard
In order to get a nice standardized dataset that you can "do AI to" (or even "do basic statistics/data analysis to") you need to:
1.
obtain the data
2.
digitize the data (if relevant)
3.
standardize/ "clean" the data
4.
set up computational infrastructure to store, query, and serve the data
Data Access Negotiation, AKA Please Let Me Do The Work You Paid Me For
Obtaining the data is a hard human problem.
That is, people don't want to give it to you.
When you're a software vendor to a large company, it's not at all unusual for it to be easier to make a multi-million dollar sale than to get the data access necessary to actually deliver the finished software tool.
Why?
Partly, this is due to security concerns. There will typically be strict IT policies about what data can be shared with outsiders, and what types of network permissions are kosher.
For instance, in the semiconduc...

2448 episodes

#Podcasting Education #The Nonlinear Fund