Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I like logs. Unlike most people selling and using observability platforms, most of the software I write is run by other people. That means it can't send me traces and I can't scrape it for metrics, but I still have to figure out and fix their problems. To me, logs are the answer. Logs are easy to pass around, and you can put whatever you want in there. I have libraries for metrics and traces, and just parse them out of the logs when that sort of presentation would be useful. (Yes, we do sampling as well.)

I keep hearing that this doesn't scale. When I worked at Google, we used this sort of system to monitor our Google Fiber devices. They just uploaded their logs every minute (stored in memory, held in memory after a warm reboot thanks to a custom linux kernel with printk_persist), and then my software processed them into metrics for the "fast query" monitoring systems. The most important metrics fed into alerts, but it didn't take very much time to just re-read all the logs if you wanted to add something new. Amazingly, the first version of this system ran on a single machine... 1 Go program handling 10,000qps of log uploads and analysis. I eventually distributed it to survive machine and datacenter failures, but it ultimately isn't that computationally intensive. The point is, it kind of scales OK. Up to 10s of terabytes a day, it's something you don't even have to think about except for the storage cost.

At some point it does make sense to move things into better databases than logs; you want to be alerted by your monitoring system that 99%-ile latency is high, then look in Jaeger for long-running traces, then take the trace ID and search your logs for it. If you start with logs, you have that capability. If you start with something else, then you just have "the program is broken, good luck" and you have to guess what the problem is whenever you debug. Ideally, your program would just tell you what's broken. That's what logs are.

One place where people get burned with logs is not being careful about what to log. Logs are the primary user interface for operators of your software (i.e. you during your oncall week), and that task deserves the attention that any other user interface task demands. People often start by logging too much, then get tired of "spam", and end up not logging enough. Then a problem occurs and the logs are outright misleading. (My favorite is event failures that are retried, but the retry isn't logged anywhere. You end up seeing "ERROR foobar attempt 1/3 failed" and have no idea of knowing that attempt 2/3 succeeded a millisecond after that log line.)

For the gophers around, here's what I do for traces: https://github.com/pachyderm/pachyderm/blob/master/src/inter... and metrics: https://github.com/pachyderm/pachyderm/blob/master/src/inter.... If you have a pipeline for storing and retrieving logs (which is exactly the case for this particular piece of software), now you have metrics and traces. It's great! I just need to write the thing to turn a set of log files into a UI that looks like Jaeger and Prometheus ;) My favorite part is that I don't need to care about the cardinality of metrics; every RPC gets its own set of metrics. So I can write a quick jq program to figure out how much bandwidth the entire system is using, or I can look at how much bandwidth one request is using. (meters logs every X bytes, and log entries have timestamps.)

I think since we've added this capability to our system, incidents are most often resolved with "that's fixed in the next patch release" instead of multiple iterations "can you try this custom build and take another debug dump". Very enjoyable.



I'm also a fan of logs. If you have some more examples of how you typically log things to be most effective, I'd love to see 'em! I'm still finding my sense for when it's too much versus too little. Best way to incorporate runtime data. How to structure log messages to work well with other systems. Hearing from others and seeing battle tested examples would surely help. Or if you're down to chat a bit I can send you an email and continue the conversation. Will check out Pachyderm in the meantime~


You can send me an email.

To me the golden rule is "show your work". Every operation that can start and end should log the start and the end. If your process is using CPU but not logging anything, something has gone wrong. Aim to log something about ongoing requests/operations every second or so. (This is spammy if you're doing 100,000 things concurrently. I use zap and zap's log sampling keys on the message; so if your message is "incoming request" and 100,000 of them are arriving per second, you can have it only write the logs for one of them each second. I hate to sample, but it's a necessity for large instances and hasn't caused me any problems yet.)

I also like to keep log levels simple; DEBUG for things interesting to the dev team, INFO for things interesting to the operations team, ERROR for things that require human intervention. People often ask me "why don't we have a WARN" level, and it's because I think warnings are either to be ignored, or are fatal. Warnings ("your object storage configuration will be deprecated in 2.10 and removed in 2.11, please migrate according to these docs") should appear in the user-facing UI, not in the logs. They do require human action eventually.

Overall, I'm more of a "print" debugger than a "step through the code with breakpoints" debugger. To me, this is an essential skill when you're running code on someone else's infrastructure; you will be 1000 times slower at operating the debugger when you are telling someone via a support ticket which commands to run. (Even if the servers are yours, I don't love sshing into production and mutating it.) So ultimately, the logs need to collect whatever you'd be looking for if you had a reproduction locally and were trying to figure out the problem. It's an art and not a science; you will get it wrong sometimes, and your resolution for the underlying bug will include better observability as part of the fix. This is usually enough to never have a problem with that subsystem again ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: