This is essentially Amazon Coral’s service log format except service logs includ...

deschutes · on Feb 28, 2024

As far as design and ergonomics go, I'd compare servicelogs to a pile of trash that may yet grow massive enough to accrete into a planetoid.

A text based format whose sole virtue is descending from a system that was composed mainly of bugs that had coalesced into perl scripts.

It's not the basis of something you could even give away, let alone have people willingly pay you for their agony. Cloudwatch being rather alike in this regard.

jcgrillo · on Feb 28, 2024

One thing that really gets under my skin when I think about observability data is the abject waste we incur by shipping all this crap around as UTF-8 bytes. This post (from 1996!) puts us all to shame: https://lists.w3.org/Archives/Public/www-logging/1996May/000...

Knowing the type of each field unlocks some interesting possibilities. If we can classify fields as STRING, INTEGER, UUID, FLOAT, TIMESTAMP, IP, etc we could store (and transmit!) them optimally. In particular, knowing whether we can delta-encode is important--if you have a timestamp column, storing the deltas (with varint or vbyte encoding) is way cheaper than storing each and every timestamp. Only store each string once, in a compressed way, and refer to it by ID (with smaller IDs for more frequent strings).

It's sickening to imagine how much could be saved by exploiting redundancy in these data if we could just know the type of each field. You get some of this with formats like protocol buffers, but not enough.

Another thing, as you mention, is optimizing for search. Indexing everything seems like the wrong move. Maybe some partial indexing strategy? Rollups? Just do everything with mapreduce jobs? I don't know what the right answer is but fully indexing data which are mostly write-only is definitely wrong.

fnordpiglet · on Feb 28, 2024

In my analysis of some exabyte scale log analytic systems the top two compute spend buckets was json serde and utf8 conversions.

ablob · on Feb 28, 2024

Storing by delta can bite you quite hard in the event of data corruption. Instead of 1 data point being affected it would cascade down. Selecting specific ranges where the concrete bottom/top as in "give me everything between 1-2 pm from last Saturday" might also become problematic. I'm sure there's a tradeoff to be had here; Weaving data-dependencies throughout your file certainly leaves a redundancy hole not everyone is willing to have.

jcgrillo · on Feb 28, 2024

I think we could limit the blast radius by working in reasonably sized chunks--like O(10-100MB)--and possibly replicating (which becomes much more attractive when the data set gets a lot smaller). But you're right, it's a good point that redundancy can be a feature.

heyoni · on Feb 28, 2024

That second part of only storing reference ids sounds like db normalizing but for logs. Seems reasonable though!

rmbyrro · on Feb 28, 2024

Which compromises in CloudWatch Log Insights makes it not the right model for "wide events"?

I have the impression it does a good job providing visibility tools (search, filter, aggregation...) over structured logs.

Ergonomics is bad, though, with the custom query language and low processing speed, depending on the amount of data you're processing during an investigation.

fnordpiglet · on Feb 28, 2024

I think you misunderstood. It does.

flaminHotSpeedo · on Feb 28, 2024

> This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chain

I don't have any experience with scuba besides this article, but I think you've missed the point. Wide events, based on my understanding, are a combination of traditional logs and something akin to service logs.

This provides two crucial improvements. The first is flexible, arbitrary associations as a first-class feature. As I interpret it, wide events give you the ability to associate a free-form traditional log message with additional dimensions, which is similar to what service logs offer but more flexible. E.g. if you log "caught unhandled FooException, returning ServerException" but only emit a metric for ServerException=1, service logs can't help you.

The other major benefit that you seem to have overlooked is a good UI to explore those events. I think most people would agree that the cloud watch UI is somewhere between bad to mediocre, but the monitor portal UI is nothing short of an unmitigated disaster. And neither give you the ability described in this article, to point and click graph events that match certain criteria. As I read it, it's the equivalent functionality to simple insights queries, except it doesn't require any typing, searching for the right dimension names, or writing stats queries to get graphs.

riku_iki · on Feb 28, 2024

> inverted index based solutions algorithmically can’t scale to arbitrary sizes

curious why do you think so? Inverted index can be sharded and built/updated/queried in parallel, so scale linearly.

fnordpiglet · on Feb 28, 2024

A few issues come up. First inverted indices can be sharded but the index insert patterns aren’t uniformly distributed but instead have a zipf distribution, which means your sharding scales proportional to the frequency of the most common token in the log. There are patches but in the end it sort of boils down to this.

Another issue is indexing up front is crazy expensive vs doing absolutely nothing but packing and time indexing, maybe some bloom indices. This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.

The technique of something like Loki is to batch data into micro batches and index them within the batches into a columnar store (like parquet of orc) and time index the micro batches. The query path is highly parallel and fairly expensive, but given the cost savings up front it’s a lot cheaper than up front indexing. You can turn the fan out knob on queries to any size and similar to MPP scale out databases such as Snowflake there’s not really much of an upper limit. Effectively everything from ingestion to query scales out linearly without uneven heat problems like you see in a sharded index.

riku_iki · on Feb 28, 2024

> which means your sharding scales proportional to the frequency of the most common token in the log

inverted index entry for frequent token can be sharded itself. You can imagine that google doesn't store all page ids in internet for the word 'hello' on the same server.

> This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.

for log processing you are likely correct. I was more wondering in general why do you think inverted index doesn't scale.

fnordpiglet · on Feb 28, 2024

These sorts of heat balancing sharding schemes are very difficult to implement and very expensive. As you see hot keys you need to split the hash space and rebalance within that space by reshuffling the shard data.

I’d note that also Google doesn’t bother keeping a perfect index because perfect fidelity isn’t necessary, unlike in a lot analytic or similar system where replication of ground truth is important. It’s much more important for Google to maintain high fidelity at the less frequent token side of the distribution and very low fidelity at the high frequency side. Logs can’t do that.

riku_iki · on Feb 28, 2024

> As you see hot keys you need to split the hash space and rebalance within that space by reshuffling the shard data.

Correct, but I think it is solvable problem, if such approach adds value (e.g. one can build and start selling product like that).

Also, complexity may be not that hard. Say, you are building some tiered shards:

- every entry at the beginning is single shard

- once shard reaches 10m records, you split it to 10 shards, which is not that difficult computation, and can be done in XXXms in single transaction

fnordpiglet · on Feb 28, 2024

It’s actually quite hard. It starts with being able to detect a hot key at all. It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot. Then it’s not about simply shuffling data (which isn’t simple as you outline - you need to salt the keys and they shuffle randomly, otherwise you don’t get uniformity), then you need to create cumulatively eventually consistent write replicas to balance write load while answering queries online in a strongly consistent way. Add to this any dynamic change in the index like this requires consistent online behavior (I.e., ingestion and queries don’t stop because you need to rebalance), and the hot keys are necessarily “large” in volume so back pressure can be enormous and queue draining itself can be expensive. Add to it you need stateful elastic infrastructure.

There are definitely products that offer these characteristics. S3 and dynamodb both do, even if you can’t see it. But it took many years of very intensive engineering to get it to work, and they have total control over the infrastructure and runtime behind an opaque api. Elastic search and Splunk are general purpose software packages that are installed by customers, and their data models are much more complex than objects or tables.

riku_iki · on Feb 28, 2024

> It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot.

I think you mixed two orthogonal topics: you first talked about frequent tokens, and now switched to hot keys(tokens which are frequently queried).

As for frequent tokens, I think I well described algorithm, and it looks simple, and I don't see any issues there, if your metadata store (where you store info about shards and replicas) allows some kind of transactions (e.g. cocroachdb or similar).

For hot keys/shards, as you pointed out, solution is to increase replication factor, but I think if shard is relatively small(10m IDs as in my example), adding another replica online is also fast, can be done in single transaction, and may not require all these movement you described.

fnordpiglet · on Feb 28, 2024

The thread is about inverted indices. Frequent tokens are hot keys.

jcgrillo · on Feb 28, 2024

I've seen situations where the cost of indexing all the logs (which were otherwise just written to a hierarchical structure in HDFS and queried with mapreduce jobs) in ES would have been highly significant--think like an uncomfortable fraction of total infrastructure spend. So, sure, you can make it scale linearly by adding enough nodes to keep up with write volume but that doesn't mean it's affordable. And then consider what that's actually accomplishing for those dollars. You're optimizing for quick search queries on data which you'll mostly never query. Worth it?

EDIT: as a user, being able to just run mapreduce jobs over logs is a heck of a lot better experience IMO than trying to torture Kibana into giving me the answers I want.

fnordpiglet · on Feb 28, 2024

Loki basically works like a nice UI on mapreducing logs. The rest is exactly my point about ELK.

riku_iki · on Feb 28, 2024

it could be shortcoming of ES specifically, since JVM is not ideal platform for heavy data crunching.