Get This: Traits, Stack Frames, and Trusting rustc

Programming languages have a responsibility to balance two, often competing goals: ergonomics and performance. The Rust programming language is interesting for many reasons, but in part because it pursues both goals with gusto. It can be a joy to use for engineering complex, adaptable systems, and it can simultaneously produce blazingly fast code. Not only are these properties valuable in their own right, but in combination they mean that the programmer can easily find themselves unwinding trains of thought that span from high level language theory to low level machine code with hardly a breath in between.

Let's explore one together.

A Few Notes and Disclaimers

  • This post was written based on the latest Rust nightly release at the time, rustc 1.92.0-nightly (4082d6a3f 2025-09-27), running on an Apple Silicon ARM processor.
  • My background is not in compilers. I'm approaching this as a bright-eyed exercise in curiosity, and will doubtless miss some points that would be obvious to a more seasoned practitioner. If that's you and you notice something I've messed up, please get in touch with me so that I can fix the error.
  • The commands below use plain rustc, not cargo rustc. The latter may implicitly set the -C incremental=true codegen option, which interferes with some optimizations. I didn't realize this until I forked and rebuilt the entire Rust compiler to figure out why its debugging dumps weren't generating correctly. While it was great fun to muck about in the compiler internals and see the changes reflected in the actual compilation pipeline, that's a blog post for another day. In the meantime, please learn from my mistakes and just use rustc.

A Treatise on Traits

Some exposition to set the stage:

Inheritance based programming languages, like Java, Python, or C++, primarily express commonality among a set of types by defining and extending a shared internal object structure. Rust, however, does not do this. Rather, it expresses commonality through contracts declaring shared external behavior, called "traits". Sharing internal structure can be convenient from an author's perspective—it inherently provides a ready mechanism for code reuse—but it can also be brittle and overly constraining for both implementers and consumers of a data type. To project a metaphor into the physical world: when you get into the driver's seat of a friend's car[1], you want to know that it has a steering wheel which can be operated to control the direction of travel (external behavior); you generally do not want to care what shape the tie rods are (internal structure).

Traits are not unique to Rust. The concept dates back at least to 2005, appearing then in the Squeak implementation of Smalltalk-80, and landed in PHP in 2012. The approach also bears strong similarities to Java's or Go's interfaces pattern.[2] Rust, though, arguably strives to be "closer to the metal" than these other languages, which can make any idiom that introduces an additional layer of indirection feel somewhat jarring, or at least raise some questions. After writing generic code in Rust for a while, one of the questions that emerged for me has to do with a corollary of the trait system: getter methods.

Go-Getters

Because traits declare shared behaviors but not shared structure, code that is generic with respect to a trait cannot access fields of a generic object directly. Rather, the idiomatic way to expose field values is via getter methods. Instead of let v = vehicle.velocity_mph; we use something like let v = vehicle.get_velocity_mph();. Elaborating on the earlier metaphor, you need to know that the car has a dial or screen which can be polled to determine the current speed, but you don't need to know ahead of time where the speedometer sits in the dashboard.[3]

Compared to exposing struct fields directly, the getter idiom has some ergonomic drawbacks: it tends to add boilerplate, and syntactic niceties like destructuring are largely left by the wayside. Disregarding those, I was curious about potential performance implications. For Rust, which rests much of its reputation on achieving ruthlessly efficient performance, it's somewhat unintuitive that a key language feature would not just encourage but require developers to write more function calls in lieu of referring directly to values in memory. Naively, don't more function calls mean more stack frames, and don't more stack frames mean more CPU cycles, more energy consumption, and slower performance?

The ostensibly correct answer is, "Trust the compiler." And I do: assuming a good choice of data structures, idiomatic, optimized Rust programs are usually extremely fast, particularly in CPU-bound environments, so they're certainly doing something right. Nevertheless, I want to understand why my gut reaction is unwarranted. Are function calls simply so inexpensive on modern processors that they're practically zero-cost? Otherwise, how are they optimized, by what, and why?

println!("Hello, lldb");

Take a trivial example—the "Hello, World!" of getter methods, if you will. The program looks like this:

// main.rs

struct Foo {
    bar: u32,
}

impl Foo {
    fn get_bar(&self) -> u32 {
        self.bar
    }
}

fn main() {
    let my_foo = Foo { bar: 0x42424242 };
    let my_bar = my_foo.get_bar();
    println!("{my_bar}");
}

This is about as simple as it gets: Instantiate a variable, call its getter function, and print the result to make sure the entire thing doesn't get optimized away in a poof of smoke.

We can compile this simple program without any optimization, with the command rustc -C opt-level=0 -C debuginfo=full src/main.rs. It runs... well, it runs quickly, but what doesn't at 5 billion clock cycles per second. Let's slow down time to get a better idea of what's happening in those nanoseconds. Using the lldb debugger to step through the program interactively demonstrates what we're concerned about:

(lldb) step
    frame #0: 0x0000000100000a98 get-this`get_this::main::h6a58a04e4763ae05 at main.rs:15:25
   12
   13   fn main() {
   14       let my_foo = Foo { bar: 0x42424242 };
-> 15       let my_bar = my_foo.get_bar();
   16       println!("{my_bar}");
   17   }

(lldb) step
    frame #0: 0x0000000100000a70 get-this`get_this::Foo::get_bar::ha98b11a79dd42ffa(self=0x000000016fdfe828) at main.rs:9:9
   6
   7    impl Foo {
   8        fn get_bar(&self) -> u32 {
-> 9            self.bar
   10       }
   11   }
   12

The program is using at least a few precious instructions to navigate up and down the call stack, when all we really want is a single instruction to access the memory for my_foo.bar. The fact that the program is taking even one extra clock cycle to step into get_bar() is a problem.

We can see this even more clearly in the disassembled functions:

get-this`get_this::main::h6a58a04e4763ae05:
    0x100000a7c <+0>:   sub    sp, sp, #0x70
    0x100000a80 <+4>:   stp    x29, x30, [sp, #0x60]
    0x100000a84 <+8>:   add    x29, sp, #0x60
    0x100000a88 <+12>:  add    x0, sp, #0x8
    0x100000a8c <+16>:  mov    w8, #0x4242 ; =16962
    0x100000a90 <+20>:  movk   w8, #0x4242, lsl #16
->  0x100000a94 <+24>:  str    w8, [sp, #0x8]
    0x100000a98 <+28>:  bl     0x100000a64    ; get_this::Foo::get_bar::ha98b11a79dd42ffa at main.rs:8
    0x100000a9c <+32>:  mov    x8, x0
    0x100000aa0 <+36>:  add    x0, sp, #0xc
    0x100000aa4 <+40>:  str    w8, [sp, #0xc]
    0x100000aa8 <+44>:  sub    x8, x29, #0x10
    0x100000aac <+48>:  bl     0x100000994    ; core::fmt::rt::Argument::new_display::hb411648afbb9506d at rt.rs:110
    0x100000ab0 <+52>:  ldur   q0, [x29, #-0x10]
    0x100000ab4 <+56>:  sub    x1, x29, #0x20
    0x100000ab8 <+60>:  stur   q0, [x29, #-0x20]
    0x100000abc <+64>:  add    x8, sp, #0x10
    0x100000ac0 <+68>:  str    x8, [sp]
    0x100000ac4 <+72>:  adrp   x0, 68
    0x100000ac8 <+76>:  add    x0, x0, #0x40
    0x100000acc <+80>:  bl     0x100000940    ; core::fmt::rt::_$LT$impl$u20$core..fmt..Arguments$GT$::new_v1::hdecc0e13a21163aa at rt.rs:209
    0x100000ad0 <+84>:  ldr    x0, [sp]
    0x100000ad4 <+88>:  bl     0x100018d48    ; std::io::stdio::_print::he9f534d0b4529084 at stdio.rs:1274
    0x100000ad8 <+92>:  ldp    x29, x30, [sp, #0x60]
    0x100000adc <+96>:  add    sp, sp, #0x70
    0x100000ae0 <+100>: ret

get-this`get_this::Foo::get_bar::ha98b11a79dd42ffa:
->  0x100000a64 <+0>:  sub    sp, sp, #0x10
    0x100000a68 <+4>:  mov    x8, x0
    0x100000a6c <+8>:  str    x8, [sp, #0x8]
    0x100000a70 <+12>: ldr    w0, [x0]
    0x100000a74 <+16>: add    sp, sp, #0x10
    0x100000a78 <+20>: ret

The get_bar() function itself at the bottom takes up 6 instructions, plus a couple extra if we count the shuffling of registers that happens around the branch point at main <+28>. This is not exactly going to fry a CPU, but it is still roughly an order of magnitude worse than we'd expect for direct access to the struct field.

Behold, the Optimizer

Let's try again with a version more akin to a release build, with rustc -C opt-level=3 src/main.rs. This is where we'd expect Rust to deliver on its promise of efficiency, and we'd be right.

Now, stepping through the code is hardly even worth it, because get_bar() has been wiped away entirely from the final assembly:

get-this`get_this::main::h6a58a04e4763ae05:
    0x1000008d4 <+0>:  sub    sp, sp, #0x60
    0x1000008d8 <+4>:  stp    x29, x30, [sp, #0x50]
    0x1000008dc <+8>:  add    x29, sp, #0x50
    0x1000008e0 <+12>: mov    w8, #0x4242 ; =16962
    0x1000008e4 <+16>: movk   w8, #0x4242, lsl #16
->  0x1000008e8 <+20>: str    w8, [sp, #0xc]
    0x1000008ec <+24>: add    x8, sp, #0xc
    0x1000008f0 <+28>: adrp   x9, 49
    0x1000008f4 <+32>: add    x9, x9, #0x59c ; core::fmt::num::imp::_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$::fmt::h1d09affc3916b914 at num.rs:213
    0x1000008f8 <+36>: stp    x8, x9, [x29, #-0x10]
    0x1000008fc <+40>: adrp   x8, 68
    0x100000900 <+44>: add    x8, x8, #0x40
    0x100000904 <+48>: mov    w9, #0x2 ; =2
    0x100000908 <+52>: stp    x8, x9, [sp, #0x10]
    0x10000090c <+56>: sub    x8, x29, #0x10
    0x100000910 <+60>: mov    w9, #0x1 ; =1
    0x100000914 <+64>: str    x8, [sp, #0x20]
    0x100000918 <+68>: stp    x9, xzr, [sp, #0x28]
    0x10000091c <+72>: add    x0, sp, #0x10
    0x100000920 <+76>: bl     0x100018ba8    ; std::io::stdio::_print::he9f534d0b4529084 at stdio.rs:1274
    0x100000924 <+80>: ldp    x29, x30, [sp, #0x50]
    0x100000928 <+84>: add    sp, sp, #0x60
    0x10000092c <+88>: ret

This is what we're hoping for: simple assembly for a simple program. Comparing the assembly code following the instructions indicated by the arrows above (str w8, [sp, #0xc]), we can see that the function call to get_bar() has been cleanly removed.

This disassembled code looks like it was generated from a Rust program with no getter whatsoever. In fact, if we write such a program, it compiles down to the exact same thing, bit for bit!

    frame #0: 0x00000001000008e8 get-this`get_this::main::h336239f06300be4d at main.rs:9:18 [opt]
   6
   7    fn main() {
   8        let my_foo = Foo { bar: 0x42424242 };
-> 9        let my_bar = my_foo.bar;
   10       println!("{my_bar}");
   11   }

(lldb) disassemble
get-this`get_this::main::h336239f06300be4d:
    0x1000008d4 <+0>:  sub    sp, sp, #0x60
    0x1000008d8 <+4>:  stp    x29, x30, [sp, #0x50]
    0x1000008dc <+8>:  add    x29, sp, #0x50
    0x1000008e0 <+12>: mov    w8, #0x4242 ; =16962
    0x1000008e4 <+16>: movk   w8, #0x4242, lsl #16
->  0x1000008e8 <+20>: str    w8, [sp, #0xc]
    0x1000008ec <+24>: add    x8, sp, #0xc
    0x1000008f0 <+28>: adrp   x9, 46
    0x1000008f4 <+32>: add    x9, x9, #0x2c0 ; core::fmt::num::imp::_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$::fmt::h367c3c8e963f8a34 at num.rs:135
    0x1000008f8 <+36>: stp    x8, x9, [x29, #-0x10]
    0x1000008fc <+40>: adrp   x8, 68
    0x100000900 <+44>: add    x8, x8, #0x40
    0x100000904 <+48>: mov    w9, #0x2 ; =2
    0x100000908 <+52>: stp    x8, x9, [sp, #0x10]
    0x10000090c <+56>: sub    x8, x29, #0x10
    0x100000910 <+60>: mov    w9, #0x1 ; =1
    0x100000914 <+64>: str    x8, [sp, #0x20]
    0x100000918 <+68>: stp    x9, xzr, [sp, #0x28]
    0x10000091c <+72>: add    x0, sp, #0x10
    0x100000920 <+76>: bl     0x1000081f4    ; std::io::stdio::_print::h7b0557dd6a526cbd at stdio.rs:1274
    0x100000924 <+80>: ldp    x29, x30, [sp, #0x50]
    0x100000928 <+84>: add    sp, sp, #0x60
    0x10000092c <+88>: ret

Falling In Line

I ran similar experiments of marginally increasing complexity—field values not knowable at compile time, field values larger than a processor register, etc.—and confirmed that the get_bar() method is successfully optimized away in similar fashion for each one.

There are actually multiple optimizations causing get_bar() to disappear, but the one that directly speaks to our initial concern—that is, the unnecessary function call—is inline expansion. Inlining is often described as compile-time macro expansion: the bodies of sufficiently short and/or infrequently used functions are effectively copied and pasted into each call site by the compiler. For big functions this can have big downsides, particularly with regards to binary size. Inlining a large function at even two call sites nearly doubles its footprint in memory. On the other hand, one-liner getter methods are perfect candidates for inline expansion. Each call site already dedicates at least two instructions to set up argument registers and make the function call, so replacing them with a getter function body (sans its stack management instructions) is nearly always a net win in terms of both performance and binary size.

Zoom In.

The Rust compiler makes it possible to observe the exact point during compilation that this specific optimization is performed. To put it in context, it helps to break down compilation into 6 basic steps:

  1. Rust code is parsed into an abstract syntax tree (AST).
  2. The AST is "lowered" to Rust's "High-Level Intermediate Representation" (HIR), which is used for type checking.
  3. Additional information computed during type checking is used to transform the HIR into the "Typed High-Level Intermediate Representation" (THIR), which is used for various other compiler checks.
  4. The THIR is lowered to Rust's "Mid-Level Intermediate Representation" (MIR), which is run through the borrow checker and a first set of optimizers.
  5. The optimized MIR is further lowered to a representation compatible with the linker—typically LLVM, which reads "LLVM-IR".
  6. The linker transforms the final intermediate representation into machine code, performing a laundry list of its own optimizations in the process.

The first 5 steps are performed internally by rustc, and the final "code generation" step is outsourced to the venerable LLVM. When it comes to optimizations, the focus is on steps 4, 5, and 6, encompassing the MIR, the LLVM-IR, and the compiled binary.

We've already been inspecting the disassembled binaries with lldb, and we can inspect the intermediate representations as well using the --emit option for rustc: rustc -C opt-level=3 -C debuginfo=full --emit=mir,llvm-ir src/main.rs. This outputs main.mir and main.ll text files alongside the main executable. They're human-readable, but only if you're a human who can read those sorts of things. At surface level, there is no mention of get_bar() in the LLVM-IR. It appears once at the top of the MIR, but we can see that it isn't referenced anywhere else, and on close inspection one can trace through the representation of the main() function to find that the memory access to the value of bar is all handled inline[4].

This leads us to believe that we're dealing with a rustc MIR optimization, lovingly crafted just for the Rust language. Cool!

Enhance.

We can still get closer. rustc also has an option to dump time-lapse-esque snapshots of the MIR at each optimization pass: rustc -C opt-level=3 -C debuginfo=full -Z dump-mir=main src/main.rs. This command generates a ream of files under mir_dump/, including three tantalizingly[5] named:

  • main.main.3-2-004.Inline.before.mir
  • main.{impl#0}-get_bar.3-2-004.Inline.before.mir
  • main.main.3-2-004.Inline.after.mir.

Here are excerpts from the "before" files:

// MIR for `<impl at src/main.rs:7:1: 7:9>::get_bar` before Inline

fn <impl at src/main.rs:7:1: 7:9>::get_bar(_1: &Foo) -> u32 {
    debug self => _1;
    let mut _0: u32;

    bb0: {
        _0 = copy ((*_1).0: u32);
        return;
    }
}


// MIR for `main` before Inline

...

    bb0: {
        StorageLive(_1);
        _1 = Foo { bar: const 1111638594_u32 };
        StorageLive(_2);
        StorageLive(_3);
        _3 = &_1;
        _2 = Foo::get_bar(move _3) -> [return: bb1, unwind continue];
    }

...and from "after" (emphasis mine):

// MIR for `main` after Inline

...

    bb0: {
        StorageLive(_1);
        _1 = Foo { bar: const 1111638594_u32 };
        StorageLive(_2);
        StorageLive(_3);
        _3 = &_1;
        // 👀 👀 👀
        _2 = copy ((*_3).0: u32);

By golly, we found it: the Inline MIR transform!

Enhance!

If we want to, we can now take a gander at the source code that implements the optimization pass we're interested in. One thing that caught my eye was the reference to an unstable codegen option named inline_mir. Here's the code for context:

impl<'tcx> crate::MirPass<'tcx> for Inline {
    fn is_enabled(&self, sess: &rustc_session::Session) -> bool {
        if let Some(enabled) = sess.opts.unstable_opts.inline_mir {
            return enabled;
        }

        match sess.mir_opt_level() {
            0 | 1 => false,
            2 => {
                (sess.opts.optimize == OptLevel::More || sess.opts.optimize == OptLevel::Aggressive)
                    && sess.opts.incremental == None
            }
            _ => true,
        }
    }

In short[6], sess.opts.unstable_opts.inline_mir appears to override the --opt-level setting, for this transform only[7]. As far as I can tell, it isn't documented anywhere, but I took an educated guess at how to set it from the command line. My first attempt failed... but thankfully it failed by printing out a list of acceptable values to use instead!

Running the following succeeds: rustc -C debuginfo=full -C opt-level=3 --emit=mir,llvm-ir -Z inline-mir=off src/main.rs. Reviewing the generated main.mir confirms that the function call to get_bar() is back! main.ll is unchanged, however. It seems that rustc still performs some analysis and optimization as part of the lowering from the optimized MIR to the LLVM-IR. Ah, well: rustc sure knows how to do its job. If there's one thing for us to learn for our troubles, it's this:

For goodness' sake, just trust the compiler.

Footnotes


  1. Your friend is showing off their new car after they just closed a five hundred million dollar seed round for their new AI company. Congrats! ↩︎

  2. As Java and PHP demonstrate, vertical inheritance is not mutually exclusive with "horizontal reuse", as PHP describes traits. However, featuring inheritance as a prominent abstraction for achieving modularity and reuse often nudges developers towards patterns and idioms which tightly couple an object's API to its structure, which trait-based design actively discourages. ↩︎

  3. As someone who frequently drives a Mini Cooper, this is particularly pertinent. ↩︎

  4. The MIR for get_bar() is not represented 1:1 within the MIR of main(), because there's additional SSA analysis applied after inlining. ↩︎

  5. I use the word "tantalizingly" because, thanks to misadventures with cargo rustc, it took me an agonizingly long time to get them to contain anything useful. ↩︎

  6. There's a little extra nuance here: sess.mir_opt_level() is derived from --opt-level, but it does not use the same numbering scheme, and it may be overridden with yet another unstable compiler option. ↩︎

  7. Alternatively, it's entirely feasible to rebuild rustc with this line simply commented out. :) ↩︎