TIL about the BPF PERCPU value memory layout

Simon Chopin

2020-08-05 309 words 2 minutes

Today, I learned that the BPF syscall expects that, in a per-cpu value entry, each CPU-specific entry has an 8-byte multiple size. This post is mostly a short note to self, and perhaps to others that encounter similar issues.

I’ve checked the code on the 4.19 stable branch as well as the very recent 5.8, and the behavior is there in both cases.

The following data structure would be problematic:

struct stats_t {
    uint32_t passed;
    uint32_t dropped;
    uint32_t redirected;
};

Let’s assume we have a BPF_PERCPU_ARRAY instance shared with an XDP program, which gathers statistics that can be held in 32-bit integers. The natural thing to do to exploit those statistics from user-space would be something along these lines. Please excuse any grammatical error, this is not working C code.

// BPF program and map loading left out for simplicity
int map_fd = load_my_bpf_map(bpf_fd);

int nb_cpus = libbp_num_possible_cpus();
if (nb_cpus < 0)
    return -1;

struct stats_t *buffer = calloc(nb_cpus, sizeof(*buffer));
if (!buffer)
    return -1;

uint32_t key = 0;

// Initialization of the map data.
if (bpf_map_update_elem(map_fd, &key, &buffer, BPF_ANY))
    goto error;

while (1) {
    struct stats_t stats = {};
    sleep(1);
    bpf_map_lookup_elem(map_fd, &key, &buffer);
    for (int i = 0; i < nb_cpus; i++) {
        stats.passed += buffer[i].passed;
        stats.dropped += buffer[i].dropped;
        stats.redirected += buffer[i].redirected;
    }

    print_stats(&stats);

    if (stats.passed > 1000000)
        break;
}

// Cleanup left as exercise to the reader

Assuming this compiles, it will not work as expected : my userspace code assumes that the data is tightly packed within the buffer, but the kernel does not.

This means that the bpf_map_update_elem call will read past the end of the buffer, and initialize the counter with random data, and might segfault, and, worse than that, bpf_map_lookup_elem will happily write after the end of our buffer. And of course our data will be garbage.

References:

The 5.8 kernel source