A few issues come up. First inverted indices can be sharded but the index insert patterns aren’t uniformly distributed but instead have a zipf distribution, which means your sharding scales proportional to the frequency of the most common token in the log. There are patches but in the end it sort of boils down to this.
Another issue is indexing up front is crazy expensive vs doing absolutely nothing but packing and time indexing, maybe some bloom indices. This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.
The technique of something like Loki is to batch data into micro batches and index them within the batches into a columnar store (like parquet of orc) and time index the micro batches. The query path is highly parallel and fairly expensive, but given the cost savings up front it’s a lot cheaper than up front indexing. You can turn the fan out knob on queries to any size and similar to MPP scale out databases such as Snowflake there’s not really much of an upper limit. Effectively everything from ingestion to query scales out linearly without uneven heat problems like you see in a sharded index.
> which means your sharding scales proportional to the frequency of the most common token in the log
inverted index entry for frequent token can be sharded itself. You can imagine that google doesn't store all page ids in internet for the word 'hello' on the same server.
> This is really important because the vast majority of log and event and telemetry in general is never accessed. Like 99.99% of it or more.
for log processing you are likely correct. I was more wondering in general why do you think inverted index doesn't scale.
These sorts of heat balancing sharding schemes are very difficult to implement and very expensive. As you see hot keys you need to split the hash space and rebalance within that space by reshuffling the shard data.
I’d note that also Google doesn’t bother keeping a perfect index because perfect fidelity isn’t necessary, unlike in a lot analytic or similar system where replication of ground truth is important. It’s much more important for Google to maintain high fidelity at the less frequent token side of the distribution and very low fidelity at the high frequency side. Logs can’t do that.
It’s actually quite hard. It starts with being able to detect a hot key at all. It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot. Then it’s not about simply shuffling data (which isn’t simple as you outline - you need to salt the keys and they shuffle randomly, otherwise you don’t get uniformity), then you need to create cumulatively eventually consistent write replicas to balance write load while answering queries online in a strongly consistent way. Add to this any dynamic change in the index like this requires consistent online behavior (I.e., ingestion and queries don’t stop because you need to rebalance), and the hot keys are necessarily “large” in volume so back pressure can be enormous and queue draining itself can be expensive. Add to it you need stateful elastic infrastructure.
There are definitely products that offer these characteristics. S3 and dynamodb both do, even if you can’t see it. But it took many years of very intensive engineering to get it to work, and they have total control over the infrastructure and runtime behind an opaque api. Elastic search and Splunk are general purpose software packages that are installed by customers, and their data models are much more complex than objects or tables.
> It’s also not the case that heat is symmetric with size, in fact in an inverted index single entries can be very hot.
I think you mixed two orthogonal topics: you first talked about frequent tokens, and now switched to hot keys(tokens which are frequently queried).
As for frequent tokens, I think I well described algorithm, and it looks simple, and I don't see any issues there, if your metadata store (where you store info about shards and replicas) allows some kind of transactions (e.g. cocroachdb or similar).
For hot keys/shards, as you pointed out, solution is to increase replication factor, but I think if shard is relatively small(10m IDs as in my example), adding another replica online is also fast, can be done in single transaction, and may not require all these movement you described.
I've seen situations where the cost of indexing all the logs (which were otherwise just written to a hierarchical structure in HDFS and queried with mapreduce jobs) in ES would have been highly significant--think like an uncomfortable fraction of total infrastructure spend. So, sure, you can make it scale linearly by adding enough nodes to keep up with write volume but that doesn't mean it's affordable. And then consider what that's actually accomplishing for those dollars. You're optimizing for quick search queries on data which you'll mostly never query. Worth it?
EDIT: as a user, being able to just run mapreduce jobs over logs is a heck of a lot better experience IMO than trying to torture Kibana into giving me the answers I want.
curious why do you think so? Inverted index can be sharded and built/updated/queried in parallel, so scale linearly.