Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unwinding the Stack the Hard Way (lesenechal.fr)
136 points by todsacerdoti on April 16, 2023 | hide | past | favorite | 51 comments


There's a far simpler and also very generic method that I'm surprised no one has really mentioned much, although it's well-known in some corners of the RE/Asm community and I believe some debuggers (but not gdb?) use it: scan the stack for values that look like valid code addresses, then disassemble backwards from there to see if they were actually return addresses written there by a call instruction. You will find a chain that leads back to the entry point.

Of course it won't work in edge cases like handwritten Asm that uses the stack in more clever ways, but when you're dealing with compiler output, it'll be fine. No need for all this complexity, and works in almost all cases.


That method is pretty much guarnateed to give you completely broken stacks. We (Sentry) use it as a fallback if nothing else works, and the success rate is awful.


Are you also doing back-disassembly to link the call chain? I've had very good success with it. I agree that without analysing the potential call sites and making sure that they are actually possible, it won't work.


Generally what we do when we fall back to scanning depends on the architecture. For x86 you can find the logic here: https://github.com/rust-minidump/rust-minidump/blob/77638ab7...

Since we do post-hoc stack walking our ability to actually look at the assembly is limited. In most cases where we have no CFI we also do not have the binary to begin with, so we're in random memory land.


If we find such an instruction, we assume it's an ip value that was pushed by the CALL instruction that created the current frame.

I'm not familiar with Rust, but if I'm reading it correctly, you're not doing back-disassembly and only checking if an address on the stack is in a code region, which certainly won't work well.

we also do not have the binary to begin with

Your "minidumps" only contain the stack?


Minidumps only contain the stack, yes, along with additional metadata, but not the original executable. This is an old format that I think Microsoft originally designed and that Google extended for the breakpad project. It's used by Chrome and Mozilla also uses it for Firefox. It's essentially a crash reporting binary format.

I linked you to it elsewhere in this thread. Sentry is using a rust implementation of the stack walker.


Please elaborate on what breaks.


The act of stackwalking breaks. You often end up in completely nonsensical stacks halfway through and then all is lost.


That makes sense, as you're fucked by any false positive.

If it's a graph of potential stacks, though, wouldn't you eventually find the one that unwinds, if one exists?


GP mentioned in another comment that they don't actually follow the algorithm described by userbinator -- checking the return address for a preceding call -- which would increase the number of false positives a lot.


The thing that makes this tricky is that x86 is a variable length instruction set: you don't know where the preceding call instruction might begin, and you can't decode backwards. You can do it speculatively, but ambiguities are still possible.


Given that the full technique apparently works well for userbinator, we can assume that this isn't a very significant issue. The chances of a phantom call instruction will be pretty low, and there's a fairly low maximum instruction length.


Yes, but checking for valid CALL instructions is probably sufficiently accurate.


Is this just an artifact of old return addresses on the stack not being overwritten?


Probably most valid code addresses on the stack are from older calls that are no longer part of the current call chain.


There's a chance of that, but in practice it's easy to filter them out, since only one of the chains will be complete from the very top of the stack to the current stack pointer. The others will almost certainly be partially overwritten by other stack activity and incomplete, or not match the current stack pointer.


This has been a fallback method for Google Breakpad's stackwalker for more than a decade. It generates basically useless stack traces, at least for stacks captured from Android where you're trying to trace through NDK code invoked from the ART. Compounding the problem is a complete lack of available symbols for Android.

https://github.com/google/breakpad/blob/main/docs/stack_walk...

Also see recent discussion here:

https://news.ycombinator.com/item?id=35438171


Not exactly.

I dug around in that code a little, and although it contains a disassembler, it's not being used for inspecting the call chain. All it does is check that the address appears to be inside a code section.

https://github.com/google/breakpad/blob/main/src/processor/s...


True, but that's because all it has is a snapshot of the stack to work with. The code isn't available to it.


To be fair, doing backtrackes through JNI calls is generally pretty complicated.


How does this work in scenarios with heavy GOT (Global Offset Table) usage, trampolines, or where v-tables are prevalent?


Do you analyze the dynamic loader segments (to determine code-like addresses) or just do some hard-coded approximations per platform? Pretty cute idea and I agree it will often work.


It just needs to be on an executable page. The processor doesn't care about anything else, so this will work with things like JIT-generated code too. The key idea is that in essentially all cases of compiler-generated code, every return address will be immediately after the call that wrote it.


Yeah, but how do you look up executable page mappings? (Dynamic loader segments will have at least one executable Seg mapped somewhere.)


Linux: /proc/$pid/maps

Windows: VirtualQuery()

Mac: vm_region()

Your own OS (like this article): you should know.


Ok, some platform specific mechanisms. Cool.


I would suggest not omitting the frame pointer. Fedora recently changed the default and it makes collecting stack traces vastly simpler, leading to better profiling support (https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...)

Since then I gave a short (15 min) talk about producing and understanding flame graphs: http://oirase.annexia.org/tmp/2023-03-08-flamegraphs.mp4


I still hope we can revert that once we have complete, CPU-verified backtraces. As a linked list, frame pointers are still somewhat slow. Just copying the hardware address stack should be quite a bit faster, and more importantly, the CPU will enforce that the addresses are correct.


Separate call and argument stacks?


Separate stack for return addresses (SHSTK, shadow stack).


Yeah, and I like I mentioned in the earlier comment, omitting the frame pointer reduces code size by 10% on RISC-V targets, which is huge when dealing with embedded flash: https://github.com/tock/tock/pull/1660


This says the penalty is 10% on risc-v vs 2-3% on arm - anyone have background for why it's so?


There can only be two answers:

- ARM frame pointer handling is more compact (relative to the non-frame pointer code) than that on RISC-V.

- ARM code does less frame pointer handling.

The first might be true because ARM has the stmdb instruction which (https://developer.arm.com/documentation/ddi0406/b/Applicatio...) stores multiple registers to consecutive memory locations using an address from a base register. The consecutive memory locations end just below this address, and the address of the first of those locations can optionally be written back to the base register.

So, code can save registers on the stack _and_ decrease the stack pointer in a single instruction.

The second might be true because, of the compilers/compiler settings used for those measurements the ARM one did more aggressive function inlining.


Yeah, mandating rbp on amd64 would have saved years of headache. There are still cases like tail calls that wouldn't work, but the vast majority of code doesn't need to care about dedicating a register to having usable backtraces.


Strongly agreed. Unwinding with libunwind, i.e. DWARF, is way, way more complex. A simple linked list is much better for the common use-cases.


That was discussed at the time on HN: https://news.ycombinator.com/item?id=34632677

My comment on that article still stands: https://news.ycombinator.com/item?id=34660474

tl;dr: don't make code generation worse for everyone else just to appease your one tiny use-case; fix your goddamn tools instead.


Walking the stack via frame pointers is also faster than some other methods (https://blog.felixge.de/reducing-gos-execution-tracer-overhe...), it appears. This benefits use cases like profilers and tracers. In big deployments, continuous (sampled) profiling and tracing are common, helping not only with continuous performance improvements in fleets but also helping to root cause outages. Reducing the cost of such infrastructure can pay off.


For the always-on open-source profiler we happen to work on at work we had to do similar things and it was even more involved since we base the whole thing on eBPF to lower the overhead and therefore needed to get the verifier to accept it. [1]

We really wish frame pointers were always present, but here we are.

[1] https://www.polarsignals.com/blog/posts/2022/11/29/profiling...


We have implemented asynchronous signal-safe in-process stack unwinding for always-on profiler in ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-perform...

The downside - it required many patches to LLVM's libunwind, and not all of them are accepted yet: https://bugs.llvm.org/show_bug.cgi?id=48186

ClickHouse source code: https://github.com/ClickHouse/ClickHouse


Please just stop omitting frame pointers. You might lose out on 6 months of CPU speed advances, but from then on out you will reap the benefits of better production observability for years to come.


> ...yet I’ve seen programs handling SIGSEGV: for God’s sake, don’t do that!

I recommend the author not examine JVM null pointer handling too closely then.


(I’m the author.) I will not then. It surprises me that the JVM would handle null pointers by hoping dereferencing one would raise a segfault, considering that the JVM is written in C AFAIK where such dereferencing invokes undefined behavior, not a memory access at address 0 which would segfault most of the time.


I appreciate you replying, and my sympathies for the posting weirdness - I was worried that I had inadvertently caused a karma trap.

From what I recall (it's been ~ 5 years), at least on OpenJ9, the potential SIGSEGV's for null references occurred in code generated by our JIT compiler, or in hand-rolled assembly, and not in code generated by any C translation unit. This may well not satisfy the standard, but honestly, at that point, I'm not sure any FFI-linked code will.

If you find this distinction insufficient, and still require the use of a JVM, if my memory serves correctly, -Xrs on OpenJ9 will avoid using those trap handlers. I don't believe this is true of Hotspot.


If the null dereferencing happens in assembly code and not in C, it seems fine to me: the C standard is irrelevant for assembly; after all, it's the C language that introduces UB, not the machine code.


I think it's an optimization. Checking for null pointer on every object access can become costly, as it's a fundamental operation (everything is a pointer in Java). By handling SIGSEGV, you save time in non-null case, at the expense of more complex code and, as you've mentioned, non-standard behavior.

Also wanted to mention your comment was [dead], had to vouch for it to resurrect it.


I understand the problem, but I am concerned by the solution. The problem is that C semantic is quite clear on the subject[1]: dereferencing a null pointer is undefined behavior, not "non-standard behavior" which would suggests correct albeit non-portable behavior. I don't know of any compiler switch that tells the compiler to "define" this behavior, like `-fno-strict-aliasing` or `-fwrapv` do.

[1] ISO/IEC 9899:2011 §6.5.3.2


I understand - I deliberately used the vague term "non-standard behavior", since the solution somehow works. There is a GCC flag "-fno-delete-null-pointer-checks" [0] which seems like it might enable this behavior, but I'm not sure.

[0] https://gcc.gnu.org/onlinedocs/gcc-7.3.0/gcc/Optimize-Option....

P.S. Seems like your account is shadowbanned. Might want to contact dang: [email protected] if you plan on posting more.


From my understanding, I think that option only prevents the compiler from removing null checks when the pointer was dereferenced before. It was probably motivated by this incident in the Kernel[1] where an optimized-out null check condition (because it was dereferenced right before) caused a vulnerability. I like this story, since both sides (Linux vs GCC) have perfectly reasonable positions IMO.

[1] https://lwn.net/Articles/342330/

PS: I took action by contacting Dang about the shadowban; thank you for warning me, I was wondering why I felt like a ghost.


Kind of cool that the register you look at after a segmentation fault is called `%rip`


I code in assembly and most of the time I don't follow the ABI or be adamant on stack frames.


I figured this article would be about Golang.

if err := nil { return err }




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: