The Tale of Strace

Sometimes, distro work takes you to some weird places… The following is an account of an investigation that took place in September 2025. Rather than going straight into the root technical issue, or even tracing a direct route between the initial report and its resolution, I’ll attempt to document the various twists and turns of the journey.

Summary

TL;DR: A simple strace rebuild on Ubuntu armhf mysteriously failed. After lots of building and chasing down wrong leads, the culprit turns out to be a malloc patch hitting a syscall normally handled by the vDSO, which does not exist for 32-bit processes on Ubuntu arm64 kernels. Lessons were learned, faces were palmed.

A bit of context first: this story takes place in late September, meaning a handful of weeks before the release. Every cycle, my coworker ginggs runs an archive-wide test rebuild to ensure that all our packages can be rebuilt, so that we can safely publish potential security updates after.

As usual, a few of the packages that we are specifically looking after were indeed failing, so we all were assigned some of those failures to investigate. The “luck” of the draw saw me assigned to strace, which had some errors at compile time.

Info

strace is a tool that tells you any time the process talks to the kernel (syscalls). As it turns out, that’s any time a process tries to do anything useful, like opening a file, writing into it, sending stuff on the network, etc…

Using strace allows me to actually look at what the program does, rather than what I assume it does based on documentation or its code. Documentation can be outdated, and the code can be misleading (or miscompiled).

Due to its nature, strace is fairly tightly coupled to the kernel, meaning it’s usually updated late in the cycle, as the kernel freeze is a few weeks after general Feature Freeze. The update didn’t happen this cycle for various reasons, and it turns out that there was a newer version available in Debian and that version was supposed to fix our compilation issue. Two birds, one stone. Cherry on top, another coworker that I happen to mentor on distro-related issues, Hector, offered to do the work in order to pad his upload-rights résumé.

He did his job well, merged our tiny delta with the new Debian version, uploaded it to a fresh PPA1, where everything built fine, at which point I reviewed and sponsored his work. Thus, strace started building in the archive.

Now, at this point I think we’d basically spent a day on this, because neither he nor I were familiar with the strace package and its history, so we had to dig through Ubuntu history to find a paper trail of previous late uploads, etc… We still felt pretty confident about this.

And then, we received the dreaded email: the build failed, on the armhf2 architecture only. What happened? It built fine in the PPA!

The reason why we didn’t see this before is because the PPA was using the default dependency configuration to build against the Ubuntu release pocket, while packages built for the archive are using the proposed pocket — our staging environment, if you will.

Looking at which packages were pulled from proposed in the failed build, three sets appeared: gcc-related packages, linux-libc-dev, and glibc. All three are damn good suspects for causing strace to fail to build.

That’s when I started a low-effort attempt at finding the culprit. To be perfectly candid, my goal here was to show that either gcc or linux-libc-dev was at fault, so that I could dump this on the lap of the teams responsible for those packages. Because as it happens, I am customarily the one blamed when glibc is at fault, and I have other bugs banging at my door already.

My attempt was pretty simple: I’d create 3 PPAs against the release pocket, then binary-copy each of the suspects in their own PPA, and rebuild strace in each PPA. It’s multi-step (the copy is weirdly enough the slowest part by far), but at least I didn’t have to think too hard about it. That is, until the results were in: glibc was the culprit. And more precisely, the code changes within the package are at fault, as reverting those changes in a new package fixes the issue.

Which meant I’d actually have to investigate this thing.

Info

For those who are a bit fuzzy on the topic, glibc is the piece of code through which most programs out there will do their syscalls. In addition, it deals with things like memory allocations, multithreading, library loading, etc… Not to be confused with Glib, which sounds vaguely similar in purpose but actually sits on top of glibc.

At that point, let’s actually look at (one of) the failures:

[snip a lot of stuff]
+ error=../../src/strace -qq -esignal=none -e%clock ../preadv2-pwritev2 output mismatch
+ check_prog diff
+ type diff
+ diff -u -- /dev/null log
+clock_gettime64(CLOCK_MONOTONIC, {tv_sec=481, tv_nsec=560941599}) = 0
+ fail_ ../../src/strace -qq -esignal=none -e%clock ../preadv2-pwritev2 output mismatch
+ warn_ trace_clock.gen.test: failed test: ../../src/strace -qq -esignal=none -e%clock ../preadv2-pwritev2 output mismatch
+ printf %s\n trace_clock.gen.test: failed test: ../../src/strace -qq -esignal=none -e%clock ../preadv2-pwritev2 output mismatch
trace_clock.gen.test: failed test: ../../src/strace -qq -esignal=none -e%clock ../preadv2-pwritev2 output mismatch
+ exit 1
FAIL trace_clock.gen.test (exit status: 1)

That’s… A bit obscure. Although immediately, one thing jumps out:

--- /dev/null	2025-08-02 18:23:40.000000000 +0000
+++ log	2025-09-23 02:35:20.465382874 +0000
@@ -0,0 +1 @@
+clock_gettime64(CLOCK_MONOTONIC, {tv_sec=481, tv_nsec=560941599}) = 0

At this point my educated guess is that whichever binary is run in this test is calling the clock_gettime64 syscall, and the test script is not expecting it. Looking at the surrounding code, the program is named preadv2-pwritev2, which presumably means it is supposed to only call those two syscalls, that are related to reading and writing files. The astute reader will have understood that the extra syscall is related to time-keeping, not files. Curious.

Now, I also have another semi-rational worry there: the 64 suffix at the end of the syscall name. It means that this is a system calls that deals with 64-bit timekeeping, rather than the native-size (32-bit) one. This worries me, because I was involved in transitioning our armhf architecture to 64-bit timekeeping by default, and it was a long and painful effort that I’d rather not revisit. It’s probably not the issue, though.

Now, to look at the glibc changes between release and proposed. Luckily for us, it was actually well documented in the package changelog:

  [ Benjamin Drung ]
  * build: use date --rfc-email for Rust coreutils compat (LP: #2122100)

  [ Simon Poirier ]
  * Explicitly enable sframe (LP: #2122099)
  * d/p/git-updates.diff: update from upstream stable branch.
    - AArch64: Fix SVE powf routine [BZ #33299]
    - i386: Also add GLIBC_ABI_GNU2_TLS version [BZ #33129]
    - x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212]
    - x86-64: Add GLIBC_ABI_GNU2_TLS version [BZ #33129]
    - i386: Add GLIBC_ABI_GNU_TLS version [BZ #33221]
    - Use TLS initial-exec model for __libc_tsd_CTYPE_* thread variables [BZ #33234]
    - malloc: Fix checking for small negative values of tcache_key
    - malloc: Make sure tcache_key is odd enough
    - malloc: Fix MAX_TCACHE_SMALL_SIZE
    - malloc: Remove redundant NULL check
  * d/p/hurd-i386/git-random-malloc.diff: drop useless patch

We can disregard the coreutils thing, that’s build-time shenanigans within the packaging script itself. The sframe thing is also probably unrelated.

The GNU2_TLS patches are all about architectures that aren’t related to the issue at hand. We can also probably dismiss the SVE thing, since we’re really not doing floating-point maths here (and it’s about AArch64 a.k.a. arm64, not armhf).

Remain the malloc fixes, and TLS initial-exec model. None of those scream “I need to know the current time” at me. However, my intuition loudly proclaims that the TLS patch is to blame, although it doesn’t have any facts to back it up.

Info

TLS here stands for thread-local storage, a nifty feature to write thread-safe code, that requires cooperation between the compiler and the C library. Useful in multithreaded context for lots of things, for example lazy-initialized static values: using TLS here means you don’t need to protect the value with a mutex, at the cost of initializing it multiple times.

malloc is the standard way to do dynamic memory allocation (a.k.a. heap allocation). It’s used basically everywhere that is not trivial. For instance, most string manipulation code out there will rely heavily on the heap.

There’s one easy way to test that intuition: yet another glibc build with only that patch reverted. However, given the long publisher delays, that would likely take up close to a full workday. Furthermore, even if our intuition is confirmed, that still doesn’t actually explain what happens.

To do that, we’ll actually have reproduce the bug in a controlled environment.

No, that’s not the name of the next Ubuntu release.

I’ll skip the side-quest into getting a proper arm64 machine to reproduce this on. Instead, dear reader, please imagine a montage sequence full of head-scratching, yelling at clouds, spurious error messages, set to the music of the Place that Sends you Mad (AKA La Maison qui rend Fou).

Instead, imagine that I have a Questing ARM64 VM, within which I run a Questing armhf LXD container, and in that VM I have the strace source package, as well as a copy of the upstream glibc repo.

The first task will be to simply run the full strace build, see if still fails locally. Luckily, it does, with a couple of tests failing. From then on, I decide to focus on running a single test, trace_clock.gen.test, for a couple of reasons. Firstly, failing tests results are annoyingly far in the scrollback and I’m lazy I avoid RSI-inducing situations. Secondly, running the full test suite takes a good 20mn. In that time, I’ll be sure to embark into unrelated side-quests, to then come back to this terminal hours later with only a vague idea of what I was trying to do there.

Now, if you’re used to modern programming languages and environment, running a single test is a pretty trivial, obvious thing to do. Sadly, it seems the packages I come across do not fall into either categories. Old C projects will usually have fairly ad-hoc test suites, without any standardized way to invoke them besides make check — I’m actually amazed that we got that much! For instance, glibc uses make test t=foo/tst-bar, and it’s even documented. It’s not bulletproof, but it works well enough.

Sadly, the strace test suite is undocumented, and pretty convoluted. Through trial and error, I eventually found out that one can use the build-aux/test-driver script, with a few magic environment variables defined (STRACE_ARCH, STRACE_EXE) to run a single test. Notably though, I also noted during my trials and errors that the number of failing tests varied from on run to another, although trace_clock.gen.test seems to be failing consistently.

Now, a single strace test case consists in running (and tracing) multiple binaries. This means we can still drill down further. And that’s when the head-scratching begins anew. You see, in the logs quoted a while ago, you might have spotted that the binary that was invoked was named ../preadv2-pwritev2. Two issues there: for one thing, when I use the test-driver script, I don’t necessarily fail on that binary. Other binary names have popped up, without any pattern that I could see. The second, and more infuriating thing, is that when I tried to run the same executable, within the same conditions, right after it just failed, it would succeed.

Interestingly, pointing STRACE_EXE to the system one exhibit the exact same behaviour. This can (and will) make life much easier, in addition to being another confusing data point.

At that point I had an idea: since this is time-related, maybe I need to have the executable running at a particular time, e.g. on the top of a second. To test this, I just ran the preadv2-pwritev2 in a loop for 1000 iterations, tracing it every time, counting how many time the dreaded clock_gettime64 would appear. And there it was, appearing 14 times.

Running this multiple times allowed me to conclude that the syscall appeared between 1% and 2% of the time, which explained why the full strace test suite would fail systematically but not at the same places: it run a lot of binaries a lot of times.

Now that we know how to trigger this somewhat reliably, we can actually try to look under the hood.

At this point, my first instinct is to reach for a rather interesting option of strace: it can print stack traces when it intercepts a syscall, using the -k toggle. The stack traces can be a bit rough, but usually give you a rough sense of where the syscalls originates from, giving us a clearer picture as to what’s actually going on.

However, it turns out that by default we only enable stacktrace support for amd64 builds, not other architectures. “Fine, I’ll do it myself!” Or at least I’ll try. If it doesn’t take too much time.

About an hour later, I actually had a build with stacktrace enabled. About half the strace testsuite was screaming bloody murder, but hey, omelet, eggs and all that… At least it was working for my purposes. Except of course, for the one syscall we’re interested in. Another dead-end.

Since strace cannot get me a stack trace, I’ll turn to a heavier solution: running the thing through GDB3. Now this is a bit trickier for a couple of reasons. The first one is that we’re dealing with a syscall and not a standard function call, which are normally where one places breakpoints. Luckily, some quick googling tells me that GDB has my back here: it can catch syscalls, using the very literal command catch syscall clock_gettime64.

The second issue is that we’re trying to debug a phenomenon that we can’t reproduce deterministically on a single process run, whereas GDB is typically used interactively, requiring user intervention multiple times per run. The trick here is to tell GDB to execute the following at startup: “set the catch thing, run the process, exit”. If the process exits normally, then GDB happily exits without a word. However, if it catches the syscall, GDB will actually ask me whether I’m sure I want to exit. This means I can loop that, and it will stop for my input when things get interesting!

Sadly, once again, things don’t go my way. GDB does catch the syscall, but is incapable of figuring out which frame we’re in, meaning I can’t walk the stack. After a few more attempts of the same, I end up just stepping instruction by instruction until I get to a point where GDB finally figures out where it is.

… and I land in stdio-related code, after a lot of stepping. That’s not helpful, as none of our patches touch that code, and the code we finally step into doesn’t look to be using those features either.

At this point, it’s worth taking a step back and try another angle. We’re basically dealing with the intersection of two projects. We’ve hacked away at the strace side, to extract a simple reproducer, time to see what we can do on the glibc side.

The first thing is to try and reproduce the issue in a development environment, with a standard glibc built out of the upstream repo, with as little of our Ubuntu-specific bells&whistles. And that’s where things start to finally turn my way. Remember how both glibc and strace deal with syscalls? It turns out that the glibc upstream project has a nifty script that allows us to run any random executable against a freshly built glibc, with built-in support for common investigative tooling for that sort of code, including strace.

Using this script and our looping approach, we can easily confirm that the issue appears on the latest upstream code. We can also check that the version matching our last known good package does not exhibit the problem. So, we have a range of commits where we know an easily testable behaviour changed: this situation is the ideal candidate to whip out git bisect.

Note

Bisecting a history is fairly intuitive. You have a last known good state, a first known bad state, and a discrete sequence from one to the other. What do you do? You check a state in the middle. If it’s good, then it’s your new last known good state, otherwise it’s your new first known bad state. You repeat until you don’t have anything between good and bad, at which point you know when things started to go bad. It’s the same strategy as the “guess a number” game.

Things get a lot trickier when the history is non-linear with branches and stuff, but Git usually handles that fairly well. The only thing that Git cannot do for you is write atomic, well documented commits. Luckily, the glibc authors already do that very well.

After a handful of bisection rounds (3 or 4), I had the verdict.

a96a82c4a5efd3139e75cd11fd2a5554164dd5a0 is the first bad commit
commit a96a82c4a5efd3139e75cd11fd2a5554164dd5a0
Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
Date:   Wed Jul 30 01:55:22 2025 +0200

    malloc: Make sure tcache_key is odd enough

… I beg your pardon?

Reverting that commit confirmed that yes, indeed, this particular change is at fault. So much for my highly tuned intuition.

Alright, let’s look at this a bit more closely:

commit 2536c4f8584082a1ac4c5e0a2a6222e290d43983
Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
Date:   Wed Jul 30 01:55:22 2025 +0200

    malloc: Make sure tcache_key is odd enough
    
    We want tcache_key not to be a commonly-occurring value in memory, so ensure
    a minimum amount of one and zero bits.
    
    And we need it non-zero, otherwise even if tcache_double_free_verify sets
    e->key to 0 before calling __libc_free, it gets called again by __libc_free,
    thus looping indefinitely.
    
    Fixes: c968fe50628db74b52124d863cd828225a1d305c ("malloc: Use tailcalls in __libc_free")

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 3a8aaeb665..c135916d32 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -230,6 +230,9 @@
 /* For uintptr_t.  */
 #include <stdint.h>
 
+/* For stdc_count_ones.  */
+#include <stdbit.h>
+
 /* For va_arg, va_start, va_end.  */
 #include <stdarg.h>
 
@@ -3095,6 +3098,19 @@ tcache_key_initialize (void)
   if (__getrandom_nocancel_nostatus_direct (&tcache_key, sizeof(tcache_key),
 					    GRND_NONBLOCK)
       != sizeof (tcache_key))
+    tcache_key = 0;
+
+  /* We need tcache_key to be non-zero (otherwise tcache_double_free_verify's
+     clearing of e->key would go unnoticed and it would loop getting called
+     through __libc_free), and we want tcache_key not to be a
+     commonly-occurring value in memory, so ensure a minimum amount of one and
+     zero bits.  */
+  int minimum_bits = __WORDSIZE / 4;
+  int maximum_bits = __WORDSIZE - minimum_bits;
+
+  while (labs (tcache_key) <= 0x1000000
+      || stdc_count_ones (tcache_key) < minimum_bits
+      || stdc_count_ones (tcache_key) > maximum_bits)
     {
       tcache_key = random_bits ();
 #if __WORDSIZE == 64

Glancing at the description first, nothing jumping out. Then looking at the added code, where again, nothing jumps out. All the new code consists of arithmetic and bitflipping primitives, none of which need to know the time of day. Besides that all you have are variable declarations and… control flow.

And that’s the real change. We go from something like this:

tcache = kernel_entropy
if kernel_entropy is None:
    tcache = random_bits()

to something like this:

tcache = kernel entropy
if kernel_entropy is None:
    tcache = 0
while tcache is not random_enough:
    tcache = random_bits()

Now, the kernel not having enough entropy is a pretty rare scenario that probably only happens around boot time. In other words, it never happens when we’re running the strace testsuite — or you have a really weird boot process. Which means we used to never have to call random_bits(). This has changed, and now your entropy must have some specific characteristics, otherwise we’ll rince and repeat albeit with a different source. One could even probably to the math to figure out the probability of that happening, assuming tcache is a 32bit integer, if one felt like it (one does not).

Sure enough, looking at the implementation of random_bits gives a small wrapper around clock_gettime64.

/* Provides fast pseudo-random bits through clock_gettime.  It has unspecified
   starting time, nano-second accuracy, its randomness is significantly better
   than gettimeofday, and for mostly architectures it is implemented through
   vDSO instead of a syscall.  Since the source is a system clock, the upper
   bits will have less entropy. */
static inline uint32_t
random_bits (void)
{
  struct __timespec64 tv;
  __clock_gettime64 (CLOCK_MONOTONIC, &tv);
  /* Shuffle the lower bits to minimize the clock bias.  */
  uint32_t ret = tv.tv_nsec ^ tv.tv_sec;
  ret ^= (ret << 24) | (ret >> 8);
  return ret;
}

Alright, it all makes sense! This patch makes it more likely to call that code path, which calls the clock_gettime64 syscall once in a while, making strace confused. It all makes sense! This is normal, we can just fix the strace test suite, pack and go home, right?

Not quite.

The problem here is that there’s a detail I didn’t tell you yet: clock_gettime64 is part of a select group of syscalls that’s usually handled through what is called the vDSO. That stands for “virtual dynamic shared object”, and it’s essentially a small virtual library that the kernel injects into programs when they start up. It’s not backed up by an actual library file, hence the “virtual” part of the name. If you’ve ever used ldd, it’s usually the first line of the output, e.g. on my laptop:

❯ ldd /usr/bin/hello
	linux-vdso.so.1 (0x000076153c446000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000076153c000000)
	/lib64/ld-linux-x86-64.so.2 (0x000076153c448000)

The raison d’être of this piece of code is performance: for those syscalls, the overhead of the roundtrip into the kernel is pretty sizeable compared to the actual action, usually simply checking a hardware register. Instead, userspace programs can now do a normal function call within that virtual library. Since that library is generated by the kernel, it knows about your hardware and can query the relevant register, and will fall back to the syscall if need be. Simple, elegant. It also means that normally, unless the program specifically wants to do a syscall (e.g. when you want to test that strace can trace it properly), clock_gettime64 should not show up on strace.

Since it does, we can presume that something is going wrong with the vDSO on armhf. A quick grepping through the kernel sources confirms that even on the specific combination of 64-bit kernel/32-bit userspace, clock_gettime64 acceleration is supported. So why do we see syscalls? Is it using the fallback path? And how often?

I actually went into this in a pretty roundabout way, for some reason. Rather than the now obvious option of adding a GDB breakpoint at the random_bits call site, I tried adding a breakpoint within the vDSO, which didn’t take, GDB not knowing the symbol, which failed miserably.

At this point, I give up on stepping into it, and rather focus on finding out how frequently it fails. Again, I got about it in a fairly strange way. The obvious plan here would be to write a program that calls clock_gettime64 in a loop, strace it, and count how many of those showed in the trace.

Instead, I modified random_bits within glibc itself, to call another syscall after calling clock_gettime64. The syscall I used was getuid as it doesn’t have any side-effect. Once again, we use strace, and compare the number of getuid to clock_gettime64 on a few hundred runs of our usual test binary.

And we get things like this.

clock_gettime64(CLOCK_MONOTONIC, {tv_sec=12931, tv_nsec=22905986}) = 0
getuid32()                              = 1000
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=12931, tv_nsec=24416637}) = 0
getuid32()                              = 1000
clock_gettime64(CLOCK_MONOTONIC, {tv_sec=12931, tv_nsec=24697600}) = 0
getuid32()                              = 1000

Sooo… Basically we always get a syscall? The vDSO is just a passtrough at this point?

Or… There is just no vDSO to begin with. I forgot to check my assumptions at the door, and assumed that there would be a vDSO available, because this has always been the case any time I’ve looked. This would explain the armhf-only failure on Ubuntu specifically.

And sure enough…

ubuntu@armhf:~$ ldd /usr/bin/dash
	libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xee620000)
	/lib/ld-linux-armhf.so.3 (0xee7a0000)

And by comparison, outside of the 32-bit container:

ubuntu@arm64:~$ ldd /usr/bin/dash
	linux-vdso.so.1 (0x0000fffff7ffe000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000fffff7d80000)
	/lib/ld-linux-aarch64.so.1 (0x0000fffff7fb0000)

And there we go. Ubuntu ARM64 kernels do not provide a vDSO for 32-bit processes, which, combined with the glibc patch, confuses the strace test suite.

At the time of writing, it’s unclear whether the kernel part of this is intentional since the configuration only mentions amd64, so I simply filed a bug to hopefully get some answers. Meanwhile, we still have a package failing to build, and a release looming.

Had I infinite time, and assuming the absence of vDSO is a deliberate choice, I would work with the upstream developers to make the testsuite take into account this possibility. However, both these conditions are impossible to fulfill right now.

So instead, I opted for a temporary workaround, where on armhf, we make the strace test suite completely ignore clock_gettime64 syscalls, whether or not they were expected. The downside of the approach is that we can’t check the correctness of that particular syscall trace on that architecture, but a temporary, targeted workaround now beats a theoretical “proper” fix that’s yet to exist.

Meanwhile, I still filed a bug upstream, to document this edge case, and perhaps one day fix it properly.


  1. Personal Package Archive ↩︎

  2. Ubuntu-ese for 32-bit ARM. Inherited from Debian, stands for ARM Harware Float. There are other variants in Debian. ↩︎

  3. GNU DeBugger ↩︎