I've seen some "data engineering" scripts that were complete messes and beyond c...

bob1029 · on Nov 3, 2024

I recently got a bit of a shocked reaction when I proposed to directly load daily files into temporary SQL tables and then use merge commands within the database to load the final tables. My use of code is essentially a shim between an SFTP client and SQL Server in this scenario. Maybe ~200 lines to connect, locate the files, run the bulk load operation, and then invoke the merge commands. Most of the fun bits are in the actual merge scripts.

Once your data is safely inside the database (temporary load tables or otherwise), there really isn't a good excuse for pulling it out and playing a bunch of circus tricks on it. Moving and transforming data within the RDBMS is infinitely more reliable than doing it with external tooling. Your ETL code should be entirely about getting the data safely into the RDBMS. It shouldn't even be responsible for testing new/deleted/modified records. You really want to use SQL for that.

You'll also be able to recruit more help if everything is neatly contained within the SQL tooling. In my scenario, business analysts can look at the merge commands and quickly iterate on the data pipeline if certain customers have weird quirks. They cannot do the same with some elaborate set of codebases, microservices, etc.

One specific thing that really sold me on this path was seeing how CTEs and views can make the T part of ETL 10000000x easier than even the fanciest code helpers like LINQ.

hinkley · on Nov 3, 2024

Except everyone wants microservices each with its own database.

392 · on Nov 5, 2024

If they can't explain why, they don't get the fun internet interview stuff

icedchai · on Nov 5, 2024

It's so much simpler to have a shared "integration" database. Until you have actual, real need for separate data sources, don't.

snidane · on Nov 3, 2024

The architecture is sound - typically called ELT these days. Dump contents of upstream straight into a database and apply stateless and deterministic operations to achieve the final result tables.

SQL server is where this breaks though. You'll get yelled by DBAs for bad db practices like storing wide text fields without casting them to varchar(32) or varchar(12), primary keys on strings or no indexes at all, and most importantly taking majority of storage on the db host for tbese raw dumps. SQL Server and any traditional database scales by adding machines, so you end up paying compute costs for your storage.

If you use a shared disk system with decoupled compute scaling from storage, then your system is the way to go. Ideally these days dump your files into a file storage like s3 and slap a table abstraction over it with some catalog and now you have 100x less storage costs and about 5-10x increased compute power with things like duckdb. Happy data engineering!

WorldMaker · on Nov 4, 2024

It amazes me how many DBAs think the limit on a varchar column impacts the disk space. The "on disk" size for `varchar(12)` and `varchar(32)` and `varchar(MAX)` are roughly the same and depends on the data itself more than the schema. That's what the "var" in "varchar" means: variable storage size. The limits like (32) were added for compatibility with `char` and for type-based "common sense" validation. Sure, it helps prevent footguns like accidental DDoS of ingesting too much data too quickly, but there are other ways to do that basic top-level validation of "is this too much data to insert?".

Five varchar(12) columns is more storage overhead than one varchar(60). There's a lot of great use cases for varchar(MAX) and everyone I ever had tell me that varchar(MAX) wasn't allowed didn't understand the internals of DB storage that they thought they did and somehow still believe in their internal model of the DB that varchar is just spicy char and fixed column size allocation.

icedchai · on Nov 4, 2024

With Postgres, we mostly just use `text` everywhere, unless there is an actual reason to have a size limit.

In other news, I haven't seen a dedicated "DBA" at a company in over a decade.

WorldMaker · on Nov 5, 2024

> With Postgres, we mostly just use `text` everywhere, unless there is an actual reason to have a size limit.

Yeah, there's still the very rare need to performance engineer out a fixed char field "to the left" of the table to speed up common table scans, but also so many of the reasons you might table scan strings have moved into proper full text search indexes or now all the rage is in vector embeddings.

> In other news, I haven't seen a dedicated "DBA" at a company in over a decade.

Yeah, anecdotally from LinkedIn and other sources it does seem like all the dedicated DBAs that have stayed that way have stuck to very specific niches and/or Oracle Products (including MySQL and derivatives these days; the "Oracle Effect" is strong). Especially in Amazon RDS and Azure SQL Server/Cosmos DB today, Postgres and Microsoft's SQL Server mostly run themselves and day-to-day administration is minor/trivial.

sevensor · on Nov 4, 2024

My experience with delta was that the catalog, being stored in s3 itself, was unacceptably slow, and for our data volume, Airflow was prohibitively expensive. We spent a lot of engineering time working around both problems. Which is funny because the consultants who advised us to do this told us it was the best possible solution; tailor made for our application, foolproof in every way. After that we proceeded to pay for their “data” “science” “services,” which went about as well as my scare quotes would suggest.

jamesblonde · on Nov 3, 2024

You're basically describing the Lakehouse Tables architecture. Store your data as tabular data in Iceberg/Hudi/Delta on S3. Save a bucket on storage. Query with whatever engine you like (Snowflake, Redshift, BQ, DuckDB, etc).

aoeusnth1 · on Nov 3, 2024

Yes, this is the vast majority of my data work at Google as well. Spanner + Files on disk (Placer) + distributed query engine (F1) which can read anything and everything (even google sheets) and join it all.

It’s amazingly productive and incredibly cheap to operate.

tgv · on Nov 3, 2024

Some of my colleagues use Microsoft PowerBI, and indeed, they upload a few hundred rows of data (and a few hundred columns, which get unpivoted in powerbi to a say 40k rows). When they upload it, the powerbi instance overloads, and people get timeouts and such. That can last up to 20 minutes. I stay away from that as far as I can.

Pxtl · on Nov 4, 2024

This is what bothers me with MS SQL related tools - they all seem horrendously brittle. Everything seems prone to deadlocks, has weird edge-cases, and incomplete coverage of the API of the next tool they're talking to so you keep having to break open the abstraction and manually tinker in the next level.