Skip to home
المدونة

Zalt Blog

Deep Dives into Code & Architecture at Scale

How Linux Bends Time Safely

By محمود الزلط
Code Cracking
30m read
<

How does an OS bend time without breaking everything that depends on it? This breakdown of Linux shows how time can be shifted while staying safe 🕒

/>
How Linux Bends Time Safely - Featured blog post image

We often think of time in systems as a single, global truth. But inside the Linux kernel, time can be bent, shifted, and isolated per container. In this article, we’ll walk through the kernel/time/namespace.c file and see how Linux implements time namespaces—and, more importantly, what this teaches us about designing safe, extensible isolation features.

My name is Mahmoud Zalt, and together we’ll treat this file as a case study in how to virtualize a core resource (time) without sacrificing safety or performance.

We’ll discover that the real story here is not just “how to add a feature,” but how to keep that feature safe as the kernel evolves: clear invariants, capability checks, defensive coding, and carefully managed one‑way transitions.

What Are Time Namespaces?

To understand this file, we first need to understand the problem it solves. Containers share a kernel but want their own view of the world: their own process IDs, their own mount tables, and in this case, their own time. A time namespace is an isolated view of monotonic and boot time, with configurable offsets from the host.

In practical terms, this allows use cases like running tests that simulate “system uptime is 3 days” without disturbing the host, or running older software that expects a certain boot age.

kernel/
  time/
    namespace.c   # time namespaces: lifecycle, VDSO/VVAR wiring, procfs
    time.c        # core timekeeping (external)
    ...

Task lifecycle and data flow (simplified):

  +---------------------+      +------------------+
  |  clone()/fork()    |      |  setns()/procfs |
  +----------+----------+      +--------+---------+
             |                          |
             v                          v
      copy_time_ns()              timens_install()
             |                          |
             v                          v
        nsproxy.time_ns         nsproxy.time_ns[_for_children]
             |                          |
             +-----------+--------------+
                         |
                         v
                 timens_on_fork()
                         |
                         v
                   timens_commit()
                         |
                         v
        +----------------+------------------+
        |  VVAR page (ns->vvar_page)       |
        |  vdso_time_data / vdso_clock     |
        +----------------+------------------+
                         |
                         v
          Userspace VDSO clock_gettime()
Time namespaces sit between process lifecycle, VDSO, and procfs.

The core lesson we’ll keep coming back to: this file is a masterclass in how to isolate a fundamental resource while keeping invariants painfully clear. Every piece of the design—offset computation, one‑time initialization, permission checks—is built to keep that isolation from turning into chaos.

Inside the Time Namespace Pipeline

Now that we know what problem we’re solving, let’s follow how a time namespace actually flows through the system—from creation to use in userspace fast paths.

Lifecycle overview

The file owns the full lifecycle of struct time_namespace:

  • Creation / cloning: clone_time_ns and copy_time_ns
  • Reference management: get_time_ns, put_time_ns via helpers like timens_get, timens_for_children_get
  • Attachment to tasks: timens_install, timens_on_fork, timens_commit
  • VDSO/VVAR wiring: timens_set_vvar_page, find_timens_vvar_page
  • Admin interfaces: proc_timens_show_offsets, proc_timens_set_offset
  • Destruction: free_time_ns

A new time namespace is born via copy_time_ns(), typically when userspace calls clone(CLONE_NEWTIME, ...). That function either reuses the parent’s namespace or calls clone_time_ns() to create a fresh one.

struct time_namespace *copy_time_ns(u64 flags,
	struct user_namespace *user_ns, struct time_namespace *old_ns)
{
	if (!(flags & CLONE_NEWTIME))
		return get_time_ns(old_ns);

	return clone_time_ns(user_ns, old_ns);
}

This is our first pattern: a tiny, readable function that encodes a high‑level policy ("reuse or clone") while delegating the messy work to a dedicated helper.

Cloning with guardrails

clone_time_ns() is a good example of how to do staged allocation with clear rollback, especially in low‑level code where partial failure is common:

static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
					struct time_namespace *old_ns)
{
	struct time_namespace *ns;
	struct ucounts *ucounts;
	int err;

	err = -ENOSPC;
	ucounts = inc_time_namespaces(user_ns);
	if (!ucounts)
		goto fail;

	err = -ENOMEM;
	ns = kzalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
	if (!ns)
		goto fail_dec;

	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
	if (!ns->vvar_page)
		goto fail_free;

	err = ns_common_init(ns);
	if (err)
		goto fail_free_page;

	ns->ucounts = ucounts;
	ns->user_ns = get_user_ns(user_ns);
	ns->offsets = old_ns->offsets;
	ns->frozen_offsets = false;
	ns_tree_add(ns);
	return ns;

fail_free_page:
	__free_page(ns->vvar_page);
fail_free:
	kfree(ns);
fail_dec:
	dec_time_namespaces(ucounts);
fail:
	return ERR_PTR(err);
}

Each resource acquisition (ucounts, kzalloc, alloc_page, ns_common_init) has a corresponding labelled failure path. The invariant is simple: for any failure, we must unwind acquired resources in exact reverse order.

This makes future changes safer. If we add a new resource (say, a new per‑namespace data structure), we can insert it into this ladder and keep the error‑handling logic structured.

Bending Time Without Breaking It

We’ve seen how namespaces are created and wired into tasks. Next, we look at the heart of the feature: how the kernel and the VDSO actually translate time with offsets, while keeping behavior safe and predictable.

Kernel‑side time translation

The function do_timens_ktime_to_host() is the pure, arithmetic core. It takes a time value expressed in a namespace and returns the equivalent in host coordinates:

ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
			struct timens_offsets *ns_offsets)
{
	ktime_t offset;

	switch (clockid) {
	case CLOCK_MONOTONIC:
		offset = timespec64_to_ktime(ns_offsets->monotonic);
		break;
	case CLOCK_BOOTTIME:
	case CLOCK_BOOTTIME_ALARM:
		offset = timespec64_to_ktime(ns_offsets->boottime);
		break;
	default:
		return tim;
	}

	/* Check that @tim value is in [offset, KTIME_MAX + offset] */
	if (tim < offset) {
		/* Already expired in host coordinates. */
		tim = 0;
	} else {
		tim = ktime_sub(tim, offset);
		if (unlikely(tim > KTIME_MAX))
			tim = KTIME_MAX;
	}

	return tim;
}

The idea is straightforward: depending on the clock ID, pick the right offset (monotonic or boottime), then normalize and clamp. If a timer is set “before” the namespace offset, it’s treated as already expired and mapped to 0. If it’s extremely far in the future, it’s clamped to KTIME_MAX to avoid overflow.

This is an example of defensive arithmetic. The function defends against broken inputs by ensuring the result always stays in a legal range, even if the caller mixes up absolute and relative time.

VDSO and VVAR: Bending time fast

Kernel syscalls are too slow for the hot path of clock_gettime(), so Linux uses the VDSO and a special memory page (VVAR) to expose time data directly to user space. Time namespaces need their own VVAR page per namespace.

timens_setup_vdso_clock_data() writes the offset metadata that VDSO code will later use:

static void timens_setup_vdso_clock_data(struct vdso_clock *vc,
					 struct time_namespace *ns)
{
	struct timens_offset *offset = vc->offset;
	struct timens_offset monotonic = offset_from_ts(ns->offsets.monotonic);
	struct timens_offset boottime = offset_from_ts(ns->offsets.boottime);

	vc->seq			= 1;
	vc->clock_mode			= VDSO_CLOCKMODE_TIMENS;
	offset[CLOCK_MONOTONIC]		= monotonic;
	offset[CLOCK_MONOTONIC_RAW]	= monotonic;
	offset[CLOCK_MONOTONIC_COARSE]	= monotonic;
	offset[CLOCK_BOOTTIME]		= boottime;
	offset[CLOCK_BOOTTIME_ALARM]	= boottime;
}

Several related clock IDs share the same underlying offset. Instead of duplicating logic per clock, the file centralizes it around this helper. This makes it easy to reason about what “monotonic in this namespace” actually means for raw and coarse variants.

One‑time VVAR initialization

We also need to answer: when is this per‑namespace VVAR page initialized? The kernel can’t afford to eagerly prepare it for every possible namespace—most of them might never be used. timens_set_vvar_page() solves this with a lazy, one‑time initialization guarded by a mutex and a flag:

static DEFINE_MUTEX(offset_lock);

static void timens_set_vvar_page(struct task_struct *task,
			struct time_namespace *ns)
{
	struct vdso_time_data *vdata;
	struct vdso_clock *vc;
	unsigned int i;

	if (ns == &init_time_ns)
		return;

	/* Fast-path, taken by every task in namespace except the first. */
	if (likely(ns->frozen_offsets))
		return;

	mutex_lock(&offset_lock);
	/* Nothing to-do: vvar_page has been already initialized. */
	if (ns->frozen_offsets)
		goto out;

	ns->frozen_offsets = true;
	vdata = page_address(ns->vvar_page);
	vc = vdata->clock_data;

	for (i = 0; i < CS_BASES; i++)
			imens_setup_vdso_clock_data(&vc[i], ns);

	if (IS_ENABLED(CONFIG_POSIX_AUX_CLOCKS)) {
		for (i = 0; i < ARRAY_SIZE(vdata->aux_clock_data); i++)
			imens_setup_vdso_clock_data(&vdata->aux_clock_data[i], ns);
	}

out:
	mutex_unlock(&offset_lock);
}

The first task that enters a non‑initial namespace triggers initialization. Afterwards, the frozen_offsets flag ensures every subsequent call is a fast, lock‑free early‑return.

This pattern—lazy init guarded by a flag and a mutex—is extremely common in high‑performance systems. It gives you both safety (no race conditions during the first initialization) and performance (no locks in the steady state).

One‑Way Doors and Lifecycle Guardrails

So far we’ve looked at pure functions and initialization logic. But the most interesting part of this file is how it treats certain actions as one‑way doors. Once you walk through them, you can’t go back—and that is exactly what keeps the system safe.

Freezing offsets

The offsets of a time namespace are configured through a procfs interface handled by proc_timens_set_offset(). This function is long, but it encodes a very important life‑cycle rule:

  • You can set offsets only while the namespace is “unfrozen.”
  • Once offsets are frozen (by first use), they become immutable.
int proc_timens_set_offset(struct file *file, struct task_struct *p,
			   struct proc_timens_offset *offsets, int noffsets)
{
	struct ns_common *ns;
	struct time_namespace *time_ns;
	struct timespec64 tp;
	int i, err;

	ns = timens_for_children_get(p);
	if (!ns)
		return -ESRCH;
	time_ns = to_time_ns(ns);

	if (!file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME)) {
		put_time_ns(time_ns);
		return -EPERM;
	}

	/* First loop: validate all requested offsets */
	for (i = 0; i < noffsets; i++) {
		struct proc_timens_offset *off = &offsets[i];

		switch (off->clockid) {
		case CLOCK_MONOTONIC:
			ktime_get_ts64(&tp);
			break;
		case CLOCK_BOOTTIME:
			ktime_get_boottime_ts64(&tp);
			break;
		default:
			err = -EINVAL;
			goto out;
		}

		err = -ERANGE;

		if (off->val.tv_sec > KTIME_SEC_MAX ||
		    off->val.tv_sec < -KTIME_SEC_MAX)
			goto out;

		tp = timespec64_add(tp, off->val);
		if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
			goto out;
	}

	mutex_lock(&offset_lock);
	if (time_ns->frozen_offsets) {
		err = -EACCES;
		goto out_unlock;
	}

	err = 0;
	/* Don't report errors after this line */
	for (i = 0; i < noffsets; i++) {
		struct proc_timens_offset *off = &offsets[i];
		struct timespec64 *offset = NULL;

		switch (off->clockid) {
		case CLOCK_MONOTONIC:
			offset = &time_ns->offsets.monotonic;
			break;
		case CLOCK_BOOTTIME:
			offset = &time_ns->offsets.boottime;
			break;
		}

		*offset = off->val;
	}

out_unlock:
	mutex_unlock(&offset_lock);
out:
	put_time_ns(time_ns);
	return err;
}

There are three distinct themes here:

  1. Authorization: file_ns_capable(..., CAP_SYS_TIME) ensures that only appropriately privileged tasks (in the right user namespace) can adjust offsets.
  2. Validation before mutation: The first loop uses realtime values (ktime_get_ts64, ktime_get_boottime_ts64) and tight bounds (KTIME_SEC_MAX, half that range) to guarantee that applying offsets won’t push derived times negative or near overflow.
  3. One‑way door: After acquiring offset_lock, the code checks time_ns->frozen_offsets. If it’s already frozen, it returns -EACCES. Once offsets are written and later the namespace is used (triggering VVAR setup), they are effectively locked in forever.

This pattern—“validate everything, then do a single atomic commit under a lock”—is a hallmark of robust configuration APIs. It ensures callers either get a clean success or no change at all.

Namespaces on fork() and setns()

Another critical lifecycle aspect is how time namespaces behave when tasks fork or call setns(). The file keeps the rules simple:

  • timens_install() updates both time_ns and time_ns_for_children in nsproxy, but only if the caller:
    • Is single‑threaded (current_is_single_threaded())
    • Holds CAP_SYS_ADMIN in both the new namespace’s user_ns and its own cred user_ns
  • timens_on_fork() ensures the child’s active namespace matches time_ns_for_children, then calls timens_commit() to initialize VVAR and bind VDSO.

This combination ensures two invariants:

  • You can’t surprise multi‑threaded processes by changing their time namespace mid‑flight.
  • Children inherit a well‑defined namespace, and their VDSO mappings are updated accordingly.

Performance and Scale: Why This Design Holds Up

So far the design looks careful and conservative. But what happens under real load—thousands of containers, each potentially with a different time namespace? This is where the performance profile in the report helps us connect design choices to real‑world behavior.

Cheap hot paths

The truly hot paths are:

  • do_timens_ktime_to_host() when used from timer and clock paths
  • VDSO fast‑path reads using the offsets in vdso_time_data

Both are O(1) with tiny constant factors: a switch on clockid, a couple of arithmetic operations, and conditional clamping. There are no loops over namespaces; each task only ever talks to its own namespace.

The suggested metric time_namespace_vvar_init_duration_seconds is a good reflection of the design goals: VVAR initialization should be well below 1ms, and because it happens once per namespace, it does not affect steady‑state latency.

Bounded per‑namespace overhead

Each time namespace owns:

  • A small struct time_namespace
  • A single VVAR page (vvar_page)
  • Offsets for monotonic and boottime

The memory footprint is modest and, importantly, independent of how many tasks are in the namespace. Container orchestrators can safely create many containers with their own time namespaces, as long as they respect ucount limits (UCOUNT_TIME_NAMESPACES), which are enforced in clone_time_ns() via inc_time_namespaces().

Aspect Design Choice Impact on Scale
Hot path time translation O(1) arithmetic, no locks Stable latency even with many namespaces
VVAR initialization Once per namespace, mutex‑guarded Negligible amortized cost per task
Offset configuration Admin‑only, mutex‑guarded, infrequent No effect on normal workloads
Namespace count ucount limits & small per‑ns state Protection from resource exhaustion

This is a general pattern for scalable features: keep the common path lock‑free and O(1), move expensive work into rare administrative or setup operations, and bound per‑instance memory overhead.

Hardening for the Future

Now we come to the part that’s most useful for us as engineers: where the design shows stress points and how small, careful refactors can make it more robust against future changes.

Defensive programming around clockids

In proc_timens_set_offset(), the first loop rejects unsupported clockids with -EINVAL. The second loop, under the lock, assumes every offset is for a supported clock and dereferences a pointer that may remain NULL if a new clock ID is ever introduced without updating this switch.

This is subtle: it’s safe today, but it becomes a time bomb if someone later adds a new supported clock to the validation loop and forgets to update the second switch.

The report suggests a low‑risk hardening refactor: add a default case that simply continues if no matching clock is found, effectively skipping unknown entries rather than risking a NULL dereference.

Separating concerns: frozen vs. initialized

As we saw earlier, frozen_offsets currently means two things at once:

  • Offsets are now immutable.
  • VVAR has been initialized for this namespace.

This is convenient but couples two logically distinct concepts. The report proposes introducing a separate vvar_initialized flag. With that split, we’d get clearer semantics:

  • vvar_initialized: has the per‑namespace VVAR page been set up?
  • frozen_offsets: are offset writes forbidden?

Splitting these responsibilities would make it easier to evolve time namespaces—for example, to allow offset configuration up until the first task actually uses VDSO data, or to support more nuanced “freeze” policies in the future.

Documenting reference counting contracts

Finally, reference counting is handled consistently but implicitly. Helpers like timens_get(), timens_for_children_get(), timens_install(), and timens_on_fork() all manipulate get_time_ns()/put_time_ns(), but their contracts are not explicitly documented in comments.

In a subsystem like namespaces, where leaks or premature frees can be catastrophic, adding 1–2 line comments stating “returns a referenced namespace; caller must call put_time_ns()” can dramatically reduce the cognitive overhead for future maintainers.

Lessons You Can Apply Today

We’ve walked through Linux’s time namespace implementation from multiple angles: lifecycle, time translation, VDSO wiring, error handling, and future hardening. Let’s distill this into a few concrete practices you can bring into your own systems—kernel or otherwise.

Lesson 1: Make invariants explicit

Time namespaces rely on a small set of critical invariants:

  • Offsets never change after being frozen.
  • Every live namespace has a valid VVAR page and ns_common initialized.
  • Reference increments are always balanced with decrements.

These are not just informal guidelines; they’re baked into the code paths and enforced via flags (frozen_offsets), mutexes (offset_lock), and structured allocation/free sequences. Whenever you design a subsystem, write down your invariants and make sure your code structure makes them easy to see.

Lesson 2: Validate before you mutate

proc_timens_set_offset() is a good template for safe configuration APIs:

  • Check ownership and capabilities first.
  • Validate every requested change (including bounds and derived values) in a read‑only pass.
  • Only after all checks pass, take the lock and apply changes in a single commit loop.

This pattern avoids partial updates and makes rollback unnecessary in the common case.

Lesson 3: Separate policy from mechanism

We’ve seen this separation throughout:

  • copy_time_ns() decides whether to create a new namespace; clone_time_ns() decides how to do it safely.
  • timens_install() encodes the policy for setns() (must be single‑threaded, must have capabilities).
  • timens_set_vvar_page() owns the mechanics of VVAR initialization.

In complex systems, mixing policy and mechanism quickly leads to functions that are impossible to test and reason about. Splitting them gives you smaller, composable units.

Lesson 4: Plan for evolution

Even in a mature codebase like the kernel, today’s correct code can be tomorrow’s bug when requirements change. The analysis highlighted two small refactors—guarding against new clock IDs and splitting frozen_offsets—that are all about future‑proofing.

Whenever you add a feature:

  • Ask what will happen if someone adds a new enum value or a new field.
  • Consider whether a flag is doing double duty and might need to be split later.
  • Add defensive fallbacks for “impossible” states where it’s cheap to do so.

The goal is not to predict every future; it’s to make future changes less fragile.

Closing thoughts

Time namespaces are a fascinating example of virtualization at the core of the operating system. But for us as engineers, their real value is as a pattern library:

  • Use pure functions and clear invariants for core logic.
  • Guard lifecycle transitions with capabilities and one‑way doors.
  • Make initialization lazy and idempotent to keep hot paths fast.
  • Harden boundaries so the subsystem stays safe as requirements evolve.

If you’re designing your own isolation mechanism—whether for tenants in a SaaS platform, virtual clusters, or per‑request configuration—this file is worth treating as required reading. The Linux kernel team had to bend time itself, and they did it without letting the system fall off the rails. Our job is to bring that same care and discipline into whatever we build next.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Unable to load source code

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 15+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss your career.

Support this content

Share this article