Enterprise-grade Data Engineering

“Data Engineering” to me isn’t about knowing SQL or Spark or Hadoop or S3. Those are just platforms - tools that get the job of managing data done easier and faster. Too often we approach “data engineering” as primarily picking a platform, shoving data into it and managing the container. For me, that’s a small part of it because the focus is right in the name: Data Engineering. Just cramming data into a database does not help extract the maximum value from it. Done poorly, you can rack up huge bills and provide the illusion that you’re running the business on it. But is it flexible? Is it locked behind a small set of people who can actually use it? Does it end up in an executive dashboard that doesn’t actually show what the executive thinks it’s showing?

Agentic AI isn’t coming to save you. The agent doesn’t understand your business or hidden institutional knowledge. It will end up guessing what things mean and propagating the same hidden bugs over and over. You need data that is engineered from the messiness that comes with real-world data to a format that humans and AI agents alike can find, manipulate, and reason with. Garbage In Garbage Out goes back to the era of vacuum tubes and hasn’t changed. Ask generative AI to write a query against messy data and you will get an answer. Will it be correct? Will any two people get the same results? Maybe. Don’t run your business on “maybe”.