At the company I work for we send json to kafka and subsiquently to Elastic search with great effect. That's basically 'wide events'. The magical thing about hooking up a bunch of pipelines with kafka is that all of a sudden your observability/metrics system becomes an amazing API for extending systems with aditional automations. Want to do something when a router connects to a network? Just subscribe to this kafka topic here. It doesn't matter that the topic was origionally intended just to log some events. We even created an open source library for writing and running these,pipelines in jupyter. Here's a super simple example https://github.com/bitswan-space/BitSwan/blob/master/example...
People tend to think kafka is hard, but as you can see from the example, it can be extremely easy.
This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.
Sometimes it seems that it's fixable by 'just having a list of people listening', and then you look and all that some of them do is mildly transform your data and pass it along. It doesn't take long before people realize that. 'just logging some events' is making future promises to other teams you don't know about, and people start being terrified of emitting anything.
This is a story I've seen in at least 4 places in my career. Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database, and you'd see yearlong projects to do some mild changes to a data model, which was originally designed in 5 minutes.
As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.
> This works well for a while. But eventually you get big, and have little to no idea of what is in your downstream. Then every single format change in any event you write must be treated like open heart surgery, because tracing your data dependencies is unreliable.
Yeah, I'd always use protobuf or similar rather than JSON for that reason, and if you need a truly breaking change I'd emit a new version of the events to a new topic rather than trying to migrate the existing one in place. It's not actually so costly to keep writing events to an old topic (and if you really want you can move that part into a separate adapter process that reads your new topic and writes to your old one). Or you can do the whole avro/schema-registry stuff if you prefer.
> Making data available to other people is not any less scary in kafka than it was back in the days where applications shared a giant database
It should be significantly less scary: it's impossible to mutate data in-place, foreign key issues are something you go back and fix and reprocess rather than something that takes down your OLTP system, schema changes are better-understood and less big-bang, event streams that are generated by transforming another event stream are completely indistinguishable from "original" event streams as opposed to views being sort-of-like-tables but having all sorts of caveats and gotchas.
> As for kafka being easy, It's not quite as hard as some people say, but it's both a pub sub system and a distributed database. When your clusters get large, it definitely isn't easy.
There are hard parts but also parts that are easier than a traditional database. There's no query planner, no MVCC, no locks, no deadlocks, no isolation levels, indices are not magic, ...
I think you're missing that person's point though. This evolution implied in the thread was:
1. Write "logging" data (observability, whatever)
2. Someone else starts using that to drive behavior
3. Change your logging, because it's just logging right? And stuff breaks.
To state it another way, anything you're emitting, _even internal logging_, is part of your API/contract, and therefore can't be changed carelessly. That problem is the same no matter what technology you use.
I think this is the crux of it, if something works for awhile then actually that's fine, as an industry we over index and scare new developers towards complexity. The counter is true too, what works at scale doesn't at non scale - not because of tech, but because holistically your asking for a lot, a lot of knowledge, a lot of complex tech to be deployed by a small team.
I'm glad that works for you but to me it sounds really expensive. At small scale you can do this any way you want but if you build an observability system with linear cost and a high coefficient it will become an issue if you run into some success.
The only expensive part is the hardwarevfor the elastic servers. Kafka is cheap to run. We have an on prem elastic db pulling in tens of thousands of events per second. On prem servers aren't that expensive. It's really just 6 servers with 20tb each and another 40tb for backups. And it's not like you have to store everything forever... Compare that data flow to everyonevwatching youtube all the time. It's really nothing...
I can name a single company in my area that runs their own servers, and they've been in the middle of a migration to the cloud for the past five years.
People tend to think kafka is hard, but as you can see from the example, it can be extremely easy.