Post

Understanding O_PATH File Descriptors in Linux

Understanding O_PATH File Descriptors in Linux

🔍 Introduction

Imagine you’re building a secure file upload and validation system for a continuous integration (CI) platform. After a user pushes a repository or artifact archive, your backend unpacks it into a temporary directory. Before any build or test can begin, you need to verify the contents to prevent abuse:

  • Ensure all files are confined to the extracted directory (no path traversal tricks like ../../etc/passwd).
  • Check metadata for auditing (e.g., file ownership, timestamps).
  • Reject symlinks or hardlinks that point outside the extraction root (e.g., to /dev/null, or /home).
  • Prevent tampering between inspection and usage (i.e., close TOCTOU holes).
  • Detect and block references to disallowed paths like /proc, /sys, or /dev.

At first glance, this might seem doable with familiar system calls like stat(), lstat(), or realpath(). But these functions operate on entire path strings, and once resolution is complete, it’s too late to detect whether any part of the path traversed symlinks or escaped a restricted root. Worse, they’re prone to time-of-check vs. time-of-use (TOCTOU) attacks — where a malicious user can replace or modify a symlink between inspection and use.

This is where O_PATH, combined with openat2() or the older *at() family, shines.

With O_PATH, you can open a file just to refer to its path, without gaining permission to read or write it. Combined with openat2()’s strict resolution flags like RESOLVE_NO_SYMLINKS and RESOLVE_BENEATH, you can securely walk the directory tree one component at a time — validating structure, checking metadata, and enforcing sandbox boundaries, all without opening file contents.

This approach is used in production by tools like runc, systemd, and container runtimes to safely traverse filesystems in privilege-separated environments.


⚠️ Naive Approach: Vulnerable File Walker in C

Let’s say we want to recursively scan a user-uploaded directory, checking each file’s stat() info and rejecting any symlinks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <dirent.h>
#include <string.h>
#include <unistd.h>

void scan_dir(const char *path) {
    DIR *dir = opendir(path);
    if (!dir) {
        perror("opendir");
        return;
    }

    struct dirent *entry;
    char fullpath[4096];

    while ((entry = readdir(dir)) != NULL) {
        // Skip "." and ".."
        if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
            continue;

        snprintf(fullpath, sizeof(fullpath), "%s/%s", path, entry->d_name);

        struct stat st;
        if (lstat(fullpath, &st) == -1) {
            perror("lstat");
            continue;
        }

        // Check for symlinks
        if (S_ISLNK(st.st_mode)) {
            fprintf(stderr, "Rejected symlink: %s\n", fullpath);
            continue;
        }

        // Recursively scan directories
        if (S_ISDIR(st.st_mode)) {
            scan_dir(fullpath);
        } else {
            printf("Accepted: %s\n", fullpath);
        }
    }

    closedir(dir);
}

int main(int argc, char **argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <path>\n", argv[0]);
        return 1;
    }

    scan_dir(argv[1]);
    return 0;
}

This looks like it checks for symlinks and avoids them. But it’s vulnerable.


🔓 Exploiting the Race Condition

Suppose your program is scanning /tmp/ci-job/uploads/abc/, and a malicious user includes a directory structure like this:

1
2
3
4
evil/
├── legit/
│   └── file.txt
└── legit -> /etc

Here’s how the attack works:

  • The user uploads evil/legit/ as a normal directory.
  • Right after scan_dir() opens evil/legit, the attacker swaps it out (via a background thread or rapid timing) with a symlink:
1
rm -rf evil/legit && ln -s /etc evil/legit
  • When scan_dir() reaches into evil/legit/, lstat() sees the original directory — but when it calls opendir() and walks it, it’s now actually walking /etc.

The result: the scanner accepts and prints arbitrary paths inside /etc.

This is a textbook TOCTOU (time-of-check vs. time-of-use) vulnerability.


✅ Secure File Walker with openat2() and O_PATH

This example:

  • Walks a directory tree starting from a trusted base fd.
  • Uses openat2() with RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS.
  • Avoids ever working with full path strings.

🛠️ Requirements

This requires:

  • Linux ≥ 5.6 (for openat2())
  • A libc that exposes openat2() or use raw syscall

For portability, we’ll use raw syscall here.


📄 Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/openat2.h>
#include <sys/syscall.h>
#include <dirent.h>
#include <unistd.h>
#include <sys/stat.h>

int openat2_wrapper(int dirfd, const char *pathname, struct open_how *how, size_t size) {
    return syscall(SYS_openat2, dirfd, pathname, how, size);
}

void scan_dir(int dirfd, const char *rel_path) {
    int newfd;

    // Set up secure openat2() parameters
    struct open_how how = {
        .flags = O_PATH | O_NOFOLLOW,
        .resolve = RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS
    };

    newfd = openat2_wrapper(dirfd, rel_path, &how, sizeof(how));
    if (newfd == -1) {
        perror("openat2");
        return;
    }

    // Re-open for listing
    int dfd = openat(newfd, ".", O_RDONLY | O_DIRECTORY | O_CLOEXEC);
    if (dfd == -1) {
        perror("open directory");
        close(newfd);
        return;
    }

    DIR *dir = fdopendir(dfd);
    if (!dir) {
        perror("fdopendir");
        close(dfd);
        close(newfd);
        return;
    }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL) {
        if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
            continue;

        printf("Found: %s/%s\n", rel_path, entry->d_name);

        // Recurse into subdirectories
        if (entry->d_type == DT_DIR) {
            char sub_path[PATH_MAX];
            if (snprintf(sub_path, sizeof(sub_path), "%s/%s", rel_path, entry->d_name) >= sizeof(sub_path)) {
                fprintf(stderr, "Path too long: %s/%s\n", rel_path, entry->d_name);
                continue;
            }
            scan_dir(newfd, sub_path);
        }
    }

    closedir(dir);
    close(newfd);
}

🧪 Example Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int main(int argc, char **argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <base-path>\n", argv[0]);
        return 1;
    }

    int basefd = open(argv[1], O_PATH | O_DIRECTORY);
    if (basefd == -1) {
        perror("open base dir");
        return 1;
    }

    scan_dir(basefd, ".");
    close(basefd);
    return 0;
}

🔐 What This Protects Against

  • ✅ Escaping the base directory (../../etc) → blocked by RESOLVE_BENEATH
  • ✅ Symlink tricks → blocked by RESOLVE_NO_SYMLINKS
  • ✅ Race conditions (TOCTOU) → avoided by resolving each component via fd

🧠 Understanding O_PATH File Descriptors

O_PATH lets you open a file only for path resolution, not access. It’s like saying: “I don’t want to read or write, I just want a reference.”

✅ What You Can Do with an O_PATH fd:

  • fstat(), fstatat()
  • openat() (for children)
  • readlinkat()
  • renameat()
  • Inspect via /proc/self/fd/

❌ What You Cannot Do:

  • read(), write(), mmap() → return EBADF
  • execveat() → not supported

Example:

1
2
int fd = open("/etc/passwd", O_PATH);
read(fd, buf, 100);  // EBADF

But you can:

1
readlink /proc/self/fd/3  # shows the real path

🏗️ O_PATH in the Real World: Tools That Rely on It

The O_PATH file descriptor may feel esoteric, but it plays a critical role in modern Linux software — especially in areas like container runtimes, service managers, and sandboxing. Here are some concrete examples:


🐳 runc and Container Runtimes

runc, the reference implementation of the Open Container Initiative (OCI), uses O_PATH heavily when setting up container mount namespaces. Why?

  • It uses O_PATH to safely traverse and bind-mount host files and directories into the container’s rootfs.
  • By opening components with O_PATH and resolving paths only relative to trusted descriptors, runc ensures that container path traversal cannot escape sandbox boundaries or follow unexpected symlinks.
  • This guards against malicious symlink or race-based filesystem exploits during container setup.

Relevant code in libcontainer that uses O_PATH.


⚙️ systemd: Secure Unit File Handling

systemd uses O_PATH in several contexts, particularly when:

  • Tracking units that reference files which may not need to be opened or read, but must be monitored.
  • Walking paths in restricted mount namespaces, where certain files should not be opened in the traditional sense.
  • Mount propagation and PrivateMounts options in unit files benefit from resolving mount targets safely via O_PATH.

It allows systemd to manage filesystem objects without exposing unnecessary capabilities.


🛡️ bubblewrap: Sandboxing for Flatpak and More

bubblewrap, a minimalistic userspace sandboxing tool (used by Flatpak), leverages O_PATH to:

  • Securely resolve and isolate mount points.
  • Prevent symlink tricks during sandbox environment construction.
  • Traverse and manipulate filesystem state without touching file contents.

This is critical for sandboxing untrusted applications and ensuring that even during setup, they can’t influence host filesystem state.


🔎 Why All These Tools Use O_PATH

  • They operate in security-sensitive environments (e.g., containers, service boundaries).
  • They manipulate filesystem state without requiring I/O access.
  • They must walk user-controlled paths safely — a classic source of bugs.
  • They want to minimize privilege while retaining full control over where operations happen.

📞 Conclusion

O_PATH gives you a capability-style reference to a file path — without opening the file. With openat2() and flags like RESOLVE_NO_SYMLINKS, you get precise, secure path traversal:

  • No symlink tricks
  • No path escapes
  • No race conditions

It’s how modern container runtimes and sandboxing tools handle filesystem traversal securely.

If you care about safe path handling in untrusted environments, you should care about O_PATH.


🔗 References

This post is licensed under CC BY 4.0 by the author.