Unwinding the Stack the Hard Way

userbinator · on April 16, 2023

There's a far simpler and also very generic method that I'm surprised no one has really mentioned much, although it's well-known in some corners of the RE/Asm community and I believe some debuggers (but not gdb?) use it: scan the stack for values that look like valid code addresses, then disassemble backwards from there to see if they were actually return addresses written there by a call instruction. You will find a chain that leads back to the entry point.

Of course it won't work in edge cases like handwritten Asm that uses the stack in more clever ways, but when you're dealing with compiler output, it'll be fine. No need for all this complexity, and works in almost all cases.

the_mitsuhiko · on April 16, 2023

That method is pretty much guarnateed to give you completely broken stacks. We (Sentry) use it as a fallback if nothing else works, and the success rate is awful.

userbinator · on April 16, 2023

Are you also doing back-disassembly to link the call chain? I've had very good success with it. I agree that without analysing the potential call sites and making sure that they are actually possible, it won't work.

the_mitsuhiko · on April 16, 2023

Generally what we do when we fall back to scanning depends on the architecture. For x86 you can find the logic here: https://github.com/rust-minidump/rust-minidump/blob/77638ab7...

Since we do post-hoc stack walking our ability to actually look at the assembly is limited. In most cases where we have no CFI we also do not have the binary to begin with, so we're in random memory land.

userbinator · on April 16, 2023

If we find such an instruction, we assume it's an ip value that was pushed by the CALL instruction that created the current frame.

I'm not familiar with Rust, but if I'm reading it correctly, you're not doing back-disassembly and only checking if an address on the stack is in a code region, which certainly won't work well.

we also do not have the binary to begin with

Your "minidumps" only contain the stack?

js2 · on April 17, 2023

Minidumps only contain the stack, yes, along with additional metadata, but not the original executable. This is an old format that I think Microsoft originally designed and that Google extended for the breakpad project. It's used by Chrome and Mozilla also uses it for Firefox. It's essentially a crash reporting binary format.

I linked you to it elsewhere in this thread. Sentry is using a rust implementation of the stack walker.

loeg · on April 16, 2023

Please elaborate on what breaks.

the_mitsuhiko · on April 16, 2023

The act of stackwalking breaks. You often end up in completely nonsensical stacks halfway through and then all is lost.

jsolson · on April 17, 2023

That makes sense, as you're fucked by any false positive.

If it's a graph of potential stacks, though, wouldn't you eventually find the one that unwinds, if one exists?

loeg · on April 17, 2023

GP mentioned in another comment that they don't actually follow the algorithm described by userbinator -- checking the return address for a preceding call -- which would increase the number of false positives a lot.

haileys · on April 17, 2023

The thing that makes this tricky is that x86 is a variable length instruction set: you don't know where the preceding call instruction might begin, and you can't decode backwards. You can do it speculatively, but ambiguities are still possible.

ghusbands · on April 17, 2023

Given that the full technique apparently works well for userbinator, we can assume that this isn't a very significant issue. The chances of a phantom call instruction will be pretty low, and there's a fairly low maximum instruction length.

loeg · on April 17, 2023

Yes, but checking for valid CALL instructions is probably sufficiently accurate.

loeg · on April 16, 2023

Is this just an artifact of old return addresses on the stack not being overwritten?

xuhu · on April 16, 2023

Probably most valid code addresses on the stack are from older calls that are no longer part of the current call chain.

userbinator · on April 16, 2023

There's a chance of that, but in practice it's easy to filter them out, since only one of the chains will be complete from the very top of the stack to the current stack pointer. The others will almost certainly be partially overwritten by other stack activity and incomplete, or not match the current stack pointer.

js2 · on April 16, 2023

This has been a fallback method for Google Breakpad's stackwalker for more than a decade. It generates basically useless stack traces, at least for stacks captured from Android where you're trying to trace through NDK code invoked from the ART. Compounding the problem is a complete lack of available symbols for Android.

https://github.com/google/breakpad/blob/main/docs/stack_walk...

Also see recent discussion here:

https://news.ycombinator.com/item?id=35438171

userbinator · on April 16, 2023

Not exactly.

I dug around in that code a little, and although it contains a disassembler, it's not being used for inspecting the call chain. All it does is check that the address appears to be inside a code section.

https://github.com/google/breakpad/blob/main/src/processor/s...

js2 · on April 17, 2023

True, but that's because all it has is a snapshot of the stack to work with. The code isn't available to it.

saagarjha · on April 16, 2023

To be fair, doing backtrackes through JNI calls is generally pretty complicated.

ShroudedNight · on April 17, 2023

How does this work in scenarios with heavy GOT (Global Offset Table) usage, trampolines, or where v-tables are prevalent?

loeg · on April 16, 2023

Do you analyze the dynamic loader segments (to determine code-like addresses) or just do some hard-coded approximations per platform? Pretty cute idea and I agree it will often work.

userbinator · on April 16, 2023

It just needs to be on an executable page. The processor doesn't care about anything else, so this will work with things like JIT-generated code too. The key idea is that in essentially all cases of compiler-generated code, every return address will be immediately after the call that wrote it.

loeg · on April 16, 2023

Yeah, but how do you look up executable page mappings? (Dynamic loader segments will have at least one executable Seg mapped somewhere.)

userbinator · on April 16, 2023

Linux: /proc/$pid/maps

Windows: VirtualQuery()

Mac: vm_region()

Your own OS (like this article): you should know.

loeg · on April 16, 2023

Ok, some platform specific mechanisms. Cool.

rwmj · on April 16, 2023

I would suggest not omitting the frame pointer. Fedora recently changed the default and it makes collecting stack traces vastly simpler, leading to better profiling support (https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...)

Since then I gave a short (15 min) talk about producing and understanding flame graphs: http://oirase.annexia.org/tmp/2023-03-08-flamegraphs.mp4

fweimer · on April 16, 2023

I still hope we can revert that once we have complete, CPU-verified backtraces. As a linked list, frame pointers are still somewhat slow. Just copying the hardware address stack should be quite a bit faster, and more importantly, the CPU will enforce that the addresses are correct.

vardump · on April 17, 2023

Separate call and argument stacks?

fweimer · on April 28, 2023

Separate stack for return addresses (SHSTK, shadow stack).

xobs · on April 17, 2023

Yeah, and I like I mentioned in the earlier comment, omitting the frame pointer reduces code size by 10% on RISC-V targets, which is huge when dealing with embedded flash: https://github.com/tock/tock/pull/1660

fulafel · on April 17, 2023

This says the penalty is 10% on risc-v vs 2-3% on arm - anyone have background for why it's so?

Someone · on April 17, 2023

There can only be two answers:

- ARM frame pointer handling is more compact (relative to the non-frame pointer code) than that on RISC-V.

- ARM code does less frame pointer handling.

The first might be true because ARM has the stmdb instruction which (https://developer.arm.com/documentation/ddi0406/b/Applicatio...) stores multiple registers to consecutive memory locations using an address from a base register. The consecutive memory locations end just below this address, and the address of the first of those locations can optionally be written back to the base register.

So, code can save registers on the stack _and_ decrease the stack pointer in a single instruction.

The second might be true because, of the compilers/compiler settings used for those measurements the ARM one did more aggressive function inlining.

boulos · on April 16, 2023

Yeah, mandating rbp on amd64 would have saved years of headache. There are still cases like tail calls that wouldn't work, but the vast majority of code doesn't need to care about dedicating a register to having usable backtraces.

ddevault · on April 17, 2023

Strongly agreed. Unwinding with libunwind, i.e. DWARF, is way, way more complex. A simple linked list is much better for the common use-cases.

userbinator · on April 17, 2023

That was discussed at the time on HN: https://news.ycombinator.com/item?id=34632677

My comment on that article still stands: https://news.ycombinator.com/item?id=34660474

tl;dr: don't make code generation worse for everyone else just to appease your one tiny use-case; fix your goddamn tools instead.

aktau · on April 17, 2023

Walking the stack via frame pointers is also faster than some other methods (https://blog.felixge.de/reducing-gos-execution-tracer-overhe...), it appears. This benefits use cases like profilers and tracers. In big deployments, continuous (sampled) profiling and tracing are common, helping not only with continuous performance improvements in fleets but also helping to root cause outages. Reducing the cost of such infrastructure can pay off.

brancz · on April 16, 2023

For the always-on open-source profiler we happen to work on at work we had to do similar things and it was even more involved since we base the whole thing on eBPF to lower the overhead and therefore needed to get the verifier to accept it. [1]

We really wish frame pointers were always present, but here we are.

[1] https://www.polarsignals.com/blog/posts/2022/11/29/profiling...

zX41ZdbW · on April 16, 2023

We have implemented asynchronous signal-safe in-process stack unwinding for always-on profiler in ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-perform...

The downside - it required many patches to LLVM's libunwind, and not all of them are accepted yet: https://bugs.llvm.org/show_bug.cgi?id=48186

ClickHouse source code: https://github.com/ClickHouse/ClickHouse

the_mitsuhiko · on April 16, 2023

Please just stop omitting frame pointers. You might lose out on 6 months of CPU speed advances, but from then on out you will reap the benefits of better production observability for years to come.

ShroudedNight · on April 17, 2023

> ...yet I’ve seen programs handling SIGSEGV: for God’s sake, don’t do that!

I recommend the author not examine JVM null pointer handling too closely then.

lesenek · on April 17, 2023

(I’m the author.) I will not then. It surprises me that the JVM would handle null pointers by hoping dereferencing one would raise a segfault, considering that the JVM is written in C AFAIK where such dereferencing invokes undefined behavior, not a memory access at address 0 which would segfault most of the time.

ShroudedNight · on April 18, 2023

I appreciate you replying, and my sympathies for the posting weirdness - I was worried that I had inadvertently caused a karma trap.

From what I recall (it's been ~ 5 years), at least on OpenJ9, the potential SIGSEGV's for null references occurred in code generated by our JIT compiler, or in hand-rolled assembly, and not in code generated by any C translation unit. This may well not satisfy the standard, but honestly, at that point, I'm not sure any FFI-linked code will.

If you find this distinction insufficient, and still require the use of a JVM, if my memory serves correctly, -Xrs on OpenJ9 will avoid using those trap handlers. I don't believe this is true of Hotspot.

lesenek · on April 18, 2023

If the null dereferencing happens in assembly code and not in C, it seems fine to me: the C standard is irrelevant for assembly; after all, it's the C language that introduces UB, not the machine code.

bheadmaster · on April 17, 2023

I think it's an optimization. Checking for null pointer on every object access can become costly, as it's a fundamental operation (everything is a pointer in Java). By handling SIGSEGV, you save time in non-null case, at the expense of more complex code and, as you've mentioned, non-standard behavior.

Also wanted to mention your comment was [dead], had to vouch for it to resurrect it.

lesenek · on April 17, 2023

I understand the problem, but I am concerned by the solution. The problem is that C semantic is quite clear on the subject[1]: dereferencing a null pointer is undefined behavior, not "non-standard behavior" which would suggests correct albeit non-portable behavior. I don't know of any compiler switch that tells the compiler to "define" this behavior, like `-fno-strict-aliasing` or `-fwrapv` do.

[1] ISO/IEC 9899:2011 §6.5.3.2

bheadmaster · on April 17, 2023

I understand - I deliberately used the vague term "non-standard behavior", since the solution somehow works. There is a GCC flag "-fno-delete-null-pointer-checks" [0] which seems like it might enable this behavior, but I'm not sure.

[0] https://gcc.gnu.org/onlinedocs/gcc-7.3.0/gcc/Optimize-Option....

P.S. Seems like your account is shadowbanned. Might want to contact dang: [email protected] if you plan on posting more.

lesenek · on April 17, 2023

From my understanding, I think that option only prevents the compiler from removing null checks when the pointer was dereferenced before. It was probably motivated by this incident in the Kernel[1] where an optimized-out null check condition (because it was dereferenced right before) caused a vulnerability. I like this story, since both sides (Linux vs GCC) have perfectly reasonable positions IMO.

[1] https://lwn.net/Articles/342330/

PS: I took action by contacting Dang about the shadowban; thank you for warning me, I was wondering why I felt like a ghost.

subarctic · on April 17, 2023

Kind of cool that the register you look at after a segmentation fault is called `%rip`

sylware · on April 17, 2023

I code in assembly and most of the time I don't follow the ABI or be adamant on stack frames.

rco8786 · on April 16, 2023

I figured this article would be about Golang.

if err := nil { return err }