We often think of time in systems as a single, global truth. But inside the Linux kernel, time can be bent, shifted, and isolated per container. In this article, we’ll walk through the kernel/time/namespace.c file and see how Linux implements time namespaces—and, more importantly, what this teaches us about designing safe, extensible isolation features.
My name is Mahmoud Zalt, and together we’ll treat this file as a case study in how to virtualize a core resource (time) without sacrificing safety or performance.
We’ll discover that the real story here is not just “how to add a feature,” but how to keep that feature safe as the kernel evolves: clear invariants, capability checks, defensive coding, and carefully managed one‑way transitions.
What Are Time Namespaces?
To understand this file, we first need to understand the problem it solves. Containers share a kernel but want their own view of the world: their own process IDs, their own mount tables, and in this case, their own time. A time namespace is an isolated view of monotonic and boot time, with configurable offsets from the host.
In practical terms, this allows use cases like running tests that simulate “system uptime is 3 days” without disturbing the host, or running older software that expects a certain boot age.
kernel/
time/
namespace.c # time namespaces: lifecycle, VDSO/VVAR wiring, procfs
time.c # core timekeeping (external)
...
Task lifecycle and data flow (simplified):
+---------------------+ +------------------+
| clone()/fork() | | setns()/procfs |
+----------+----------+ +--------+---------+
| |
v v
copy_time_ns() timens_install()
| |
v v
nsproxy.time_ns nsproxy.time_ns[_for_children]
| |
+-----------+--------------+
|
v
timens_on_fork()
|
v
timens_commit()
|
v
+----------------+------------------+
| VVAR page (ns->vvar_page) |
| vdso_time_data / vdso_clock |
+----------------+------------------+
|
v
Userspace VDSO clock_gettime()
The core lesson we’ll keep coming back to: this file is a masterclass in how to isolate a fundamental resource while keeping invariants painfully clear. Every piece of the design—offset computation, one‑time initialization, permission checks—is built to keep that isolation from turning into chaos.
Inside the Time Namespace Pipeline
Now that we know what problem we’re solving, let’s follow how a time namespace actually flows through the system—from creation to use in userspace fast paths.
Lifecycle overview
The file owns the full lifecycle of struct time_namespace:
- Creation / cloning:
clone_time_nsandcopy_time_ns - Reference management:
get_time_ns,put_time_nsvia helpers liketimens_get,timens_for_children_get - Attachment to tasks:
timens_install,timens_on_fork,timens_commit - VDSO/VVAR wiring:
timens_set_vvar_page,find_timens_vvar_page - Admin interfaces:
proc_timens_show_offsets,proc_timens_set_offset - Destruction:
free_time_ns
A new time namespace is born via copy_time_ns(), typically when userspace calls clone(CLONE_NEWTIME, ...). That function either reuses the parent’s namespace or calls clone_time_ns() to create a fresh one.
struct time_namespace *copy_time_ns(u64 flags,
struct user_namespace *user_ns, struct time_namespace *old_ns)
{
if (!(flags & CLONE_NEWTIME))
return get_time_ns(old_ns);
return clone_time_ns(user_ns, old_ns);
}
This is our first pattern: a tiny, readable function that encodes a high‑level policy ("reuse or clone") while delegating the messy work to a dedicated helper.
Cloning with guardrails
clone_time_ns() is a good example of how to do staged allocation with clear rollback, especially in low‑level code where partial failure is common:
static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
struct time_namespace *old_ns)
{
struct time_namespace *ns;
struct ucounts *ucounts;
int err;
err = -ENOSPC;
ucounts = inc_time_namespaces(user_ns);
if (!ucounts)
goto fail;
err = -ENOMEM;
ns = kzalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
if (!ns)
goto fail_dec;
ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!ns->vvar_page)
goto fail_free;
err = ns_common_init(ns);
if (err)
goto fail_free_page;
ns->ucounts = ucounts;
ns->user_ns = get_user_ns(user_ns);
ns->offsets = old_ns->offsets;
ns->frozen_offsets = false;
ns_tree_add(ns);
return ns;
fail_free_page:
__free_page(ns->vvar_page);
fail_free:
kfree(ns);
fail_dec:
dec_time_namespaces(ucounts);
fail:
return ERR_PTR(err);
}
Each resource acquisition (ucounts, kzalloc, alloc_page, ns_common_init) has a corresponding labelled failure path. The invariant is simple: for any failure, we must unwind acquired resources in exact reverse order.
This makes future changes safer. If we add a new resource (say, a new per‑namespace data structure), we can insert it into this ladder and keep the error‑handling logic structured.
Bending Time Without Breaking It
We’ve seen how namespaces are created and wired into tasks. Next, we look at the heart of the feature: how the kernel and the VDSO actually translate time with offsets, while keeping behavior safe and predictable.
Kernel‑side time translation
The function do_timens_ktime_to_host() is the pure, arithmetic core. It takes a time value expressed in a namespace and returns the equivalent in host coordinates:
ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
struct timens_offsets *ns_offsets)
{
ktime_t offset;
switch (clockid) {
case CLOCK_MONOTONIC:
offset = timespec64_to_ktime(ns_offsets->monotonic);
break;
case CLOCK_BOOTTIME:
case CLOCK_BOOTTIME_ALARM:
offset = timespec64_to_ktime(ns_offsets->boottime);
break;
default:
return tim;
}
/* Check that @tim value is in [offset, KTIME_MAX + offset] */
if (tim < offset) {
/* Already expired in host coordinates. */
tim = 0;
} else {
tim = ktime_sub(tim, offset);
if (unlikely(tim > KTIME_MAX))
tim = KTIME_MAX;
}
return tim;
}
The idea is straightforward: depending on the clock ID, pick the right offset (monotonic or boottime), then normalize and clamp. If a timer is set “before” the namespace offset, it’s treated as already expired and mapped to 0. If it’s extremely far in the future, it’s clamped to KTIME_MAX to avoid overflow.
This is an example of defensive arithmetic. The function defends against broken inputs by ensuring the result always stays in a legal range, even if the caller mixes up absolute and relative time.
VDSO and VVAR: Bending time fast
Kernel syscalls are too slow for the hot path of clock_gettime(), so Linux uses the VDSO and a special memory page (VVAR) to expose time data directly to user space. Time namespaces need their own VVAR page per namespace.
timens_setup_vdso_clock_data() writes the offset metadata that VDSO code will later use:
static void timens_setup_vdso_clock_data(struct vdso_clock *vc,
struct time_namespace *ns)
{
struct timens_offset *offset = vc->offset;
struct timens_offset monotonic = offset_from_ts(ns->offsets.monotonic);
struct timens_offset boottime = offset_from_ts(ns->offsets.boottime);
vc->seq = 1;
vc->clock_mode = VDSO_CLOCKMODE_TIMENS;
offset[CLOCK_MONOTONIC] = monotonic;
offset[CLOCK_MONOTONIC_RAW] = monotonic;
offset[CLOCK_MONOTONIC_COARSE] = monotonic;
offset[CLOCK_BOOTTIME] = boottime;
offset[CLOCK_BOOTTIME_ALARM] = boottime;
}
Several related clock IDs share the same underlying offset. Instead of duplicating logic per clock, the file centralizes it around this helper. This makes it easy to reason about what “monotonic in this namespace” actually means for raw and coarse variants.
One‑time VVAR initialization
We also need to answer: when is this per‑namespace VVAR page initialized? The kernel can’t afford to eagerly prepare it for every possible namespace—most of them might never be used. timens_set_vvar_page() solves this with a lazy, one‑time initialization guarded by a mutex and a flag:
static DEFINE_MUTEX(offset_lock);
static void timens_set_vvar_page(struct task_struct *task,
struct time_namespace *ns)
{
struct vdso_time_data *vdata;
struct vdso_clock *vc;
unsigned int i;
if (ns == &init_time_ns)
return;
/* Fast-path, taken by every task in namespace except the first. */
if (likely(ns->frozen_offsets))
return;
mutex_lock(&offset_lock);
/* Nothing to-do: vvar_page has been already initialized. */
if (ns->frozen_offsets)
goto out;
ns->frozen_offsets = true;
vdata = page_address(ns->vvar_page);
vc = vdata->clock_data;
for (i = 0; i < CS_BASES; i++)
imens_setup_vdso_clock_data(&vc[i], ns);
if (IS_ENABLED(CONFIG_POSIX_AUX_CLOCKS)) {
for (i = 0; i < ARRAY_SIZE(vdata->aux_clock_data); i++)
imens_setup_vdso_clock_data(&vdata->aux_clock_data[i], ns);
}
out:
mutex_unlock(&offset_lock);
}
The first task that enters a non‑initial namespace triggers initialization. Afterwards, the frozen_offsets flag ensures every subsequent call is a fast, lock‑free early‑return.
This pattern—lazy init guarded by a flag and a mutex—is extremely common in high‑performance systems. It gives you both safety (no race conditions during the first initialization) and performance (no locks in the steady state).
One‑Way Doors and Lifecycle Guardrails
So far we’ve looked at pure functions and initialization logic. But the most interesting part of this file is how it treats certain actions as one‑way doors. Once you walk through them, you can’t go back—and that is exactly what keeps the system safe.
Freezing offsets
The offsets of a time namespace are configured through a procfs interface handled by proc_timens_set_offset(). This function is long, but it encodes a very important life‑cycle rule:
- You can set offsets only while the namespace is “unfrozen.”
- Once offsets are frozen (by first use), they become immutable.
int proc_timens_set_offset(struct file *file, struct task_struct *p,
struct proc_timens_offset *offsets, int noffsets)
{
struct ns_common *ns;
struct time_namespace *time_ns;
struct timespec64 tp;
int i, err;
ns = timens_for_children_get(p);
if (!ns)
return -ESRCH;
time_ns = to_time_ns(ns);
if (!file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME)) {
put_time_ns(time_ns);
return -EPERM;
}
/* First loop: validate all requested offsets */
for (i = 0; i < noffsets; i++) {
struct proc_timens_offset *off = &offsets[i];
switch (off->clockid) {
case CLOCK_MONOTONIC:
ktime_get_ts64(&tp);
break;
case CLOCK_BOOTTIME:
ktime_get_boottime_ts64(&tp);
break;
default:
err = -EINVAL;
goto out;
}
err = -ERANGE;
if (off->val.tv_sec > KTIME_SEC_MAX ||
off->val.tv_sec < -KTIME_SEC_MAX)
goto out;
tp = timespec64_add(tp, off->val);
if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
goto out;
}
mutex_lock(&offset_lock);
if (time_ns->frozen_offsets) {
err = -EACCES;
goto out_unlock;
}
err = 0;
/* Don't report errors after this line */
for (i = 0; i < noffsets; i++) {
struct proc_timens_offset *off = &offsets[i];
struct timespec64 *offset = NULL;
switch (off->clockid) {
case CLOCK_MONOTONIC:
offset = &time_ns->offsets.monotonic;
break;
case CLOCK_BOOTTIME:
offset = &time_ns->offsets.boottime;
break;
}
*offset = off->val;
}
out_unlock:
mutex_unlock(&offset_lock);
out:
put_time_ns(time_ns);
return err;
}
There are three distinct themes here:
- Authorization:
file_ns_capable(..., CAP_SYS_TIME)ensures that only appropriately privileged tasks (in the right user namespace) can adjust offsets. - Validation before mutation: The first loop uses realtime values (
ktime_get_ts64,ktime_get_boottime_ts64) and tight bounds (KTIME_SEC_MAX, half that range) to guarantee that applying offsets won’t push derived times negative or near overflow. - One‑way door: After acquiring
offset_lock, the code checkstime_ns->frozen_offsets. If it’s already frozen, it returns-EACCES. Once offsets are written and later the namespace is used (triggering VVAR setup), they are effectively locked in forever.
This pattern—“validate everything, then do a single atomic commit under a lock”—is a hallmark of robust configuration APIs. It ensures callers either get a clean success or no change at all.
Namespaces on fork() and setns()
Another critical lifecycle aspect is how time namespaces behave when tasks fork or call setns(). The file keeps the rules simple:
timens_install()updates bothtime_nsandtime_ns_for_childreninnsproxy, but only if the caller:- Is single‑threaded (
current_is_single_threaded()) - Holds
CAP_SYS_ADMINin both the new namespace’s user_ns and its own cred user_ns timens_on_fork()ensures the child’s active namespace matchestime_ns_for_children, then callstimens_commit()to initialize VVAR and bind VDSO.
This combination ensures two invariants:
- You can’t surprise multi‑threaded processes by changing their time namespace mid‑flight.
- Children inherit a well‑defined namespace, and their VDSO mappings are updated accordingly.
Performance and Scale: Why This Design Holds Up
So far the design looks careful and conservative. But what happens under real load—thousands of containers, each potentially with a different time namespace? This is where the performance profile in the report helps us connect design choices to real‑world behavior.
Cheap hot paths
The truly hot paths are:
do_timens_ktime_to_host()when used from timer and clock paths- VDSO fast‑path reads using the offsets in
vdso_time_data
Both are O(1) with tiny constant factors: a switch on clockid, a couple of arithmetic operations, and conditional clamping. There are no loops over namespaces; each task only ever talks to its own namespace.
The suggested metric time_namespace_vvar_init_duration_seconds is a good reflection of the design goals: VVAR initialization should be well below 1ms, and because it happens once per namespace, it does not affect steady‑state latency.
Bounded per‑namespace overhead
Each time namespace owns:
- A small
struct time_namespace - A single VVAR page (
vvar_page) - Offsets for monotonic and boottime
The memory footprint is modest and, importantly, independent of how many tasks are in the namespace. Container orchestrators can safely create many containers with their own time namespaces, as long as they respect ucount limits (UCOUNT_TIME_NAMESPACES), which are enforced in clone_time_ns() via inc_time_namespaces().
| Aspect | Design Choice | Impact on Scale |
|---|---|---|
| Hot path time translation | O(1) arithmetic, no locks | Stable latency even with many namespaces |
| VVAR initialization | Once per namespace, mutex‑guarded | Negligible amortized cost per task |
| Offset configuration | Admin‑only, mutex‑guarded, infrequent | No effect on normal workloads |
| Namespace count | ucount limits & small per‑ns state | Protection from resource exhaustion |
This is a general pattern for scalable features: keep the common path lock‑free and O(1), move expensive work into rare administrative or setup operations, and bound per‑instance memory overhead.
Hardening for the Future
Now we come to the part that’s most useful for us as engineers: where the design shows stress points and how small, careful refactors can make it more robust against future changes.
Defensive programming around clockids
In proc_timens_set_offset(), the first loop rejects unsupported clockids with -EINVAL. The second loop, under the lock, assumes every offset is for a supported clock and dereferences a pointer that may remain NULL if a new clock ID is ever introduced without updating this switch.
This is subtle: it’s safe today, but it becomes a time bomb if someone later adds a new supported clock to the validation loop and forgets to update the second switch.
The report suggests a low‑risk hardening refactor: add a default case that simply continues if no matching clock is found, effectively skipping unknown entries rather than risking a NULL dereference.
Separating concerns: frozen vs. initialized
As we saw earlier, frozen_offsets currently means two things at once:
- Offsets are now immutable.
- VVAR has been initialized for this namespace.
This is convenient but couples two logically distinct concepts. The report proposes introducing a separate vvar_initialized flag. With that split, we’d get clearer semantics:
vvar_initialized: has the per‑namespace VVAR page been set up?frozen_offsets: are offset writes forbidden?
Splitting these responsibilities would make it easier to evolve time namespaces—for example, to allow offset configuration up until the first task actually uses VDSO data, or to support more nuanced “freeze” policies in the future.
Documenting reference counting contracts
Finally, reference counting is handled consistently but implicitly. Helpers like timens_get(), timens_for_children_get(), timens_install(), and timens_on_fork() all manipulate get_time_ns()/put_time_ns(), but their contracts are not explicitly documented in comments.
In a subsystem like namespaces, where leaks or premature frees can be catastrophic, adding 1–2 line comments stating “returns a referenced namespace; caller must call put_time_ns()” can dramatically reduce the cognitive overhead for future maintainers.
Lessons You Can Apply Today
We’ve walked through Linux’s time namespace implementation from multiple angles: lifecycle, time translation, VDSO wiring, error handling, and future hardening. Let’s distill this into a few concrete practices you can bring into your own systems—kernel or otherwise.
Lesson 1: Make invariants explicit
Time namespaces rely on a small set of critical invariants:
- Offsets never change after being frozen.
- Every live namespace has a valid VVAR page and
ns_commoninitialized. - Reference increments are always balanced with decrements.
These are not just informal guidelines; they’re baked into the code paths and enforced via flags (frozen_offsets), mutexes (offset_lock), and structured allocation/free sequences. Whenever you design a subsystem, write down your invariants and make sure your code structure makes them easy to see.
Lesson 2: Validate before you mutate
proc_timens_set_offset() is a good template for safe configuration APIs:
- Check ownership and capabilities first.
- Validate every requested change (including bounds and derived values) in a read‑only pass.
- Only after all checks pass, take the lock and apply changes in a single commit loop.
This pattern avoids partial updates and makes rollback unnecessary in the common case.
Lesson 3: Separate policy from mechanism
We’ve seen this separation throughout:
copy_time_ns()decides whether to create a new namespace;clone_time_ns()decides how to do it safely.timens_install()encodes the policy forsetns()(must be single‑threaded, must have capabilities).timens_set_vvar_page()owns the mechanics of VVAR initialization.
In complex systems, mixing policy and mechanism quickly leads to functions that are impossible to test and reason about. Splitting them gives you smaller, composable units.
Lesson 4: Plan for evolution
Even in a mature codebase like the kernel, today’s correct code can be tomorrow’s bug when requirements change. The analysis highlighted two small refactors—guarding against new clock IDs and splitting frozen_offsets—that are all about future‑proofing.
Whenever you add a feature:
- Ask what will happen if someone adds a new enum value or a new field.
- Consider whether a flag is doing double duty and might need to be split later.
- Add defensive fallbacks for “impossible” states where it’s cheap to do so.
The goal is not to predict every future; it’s to make future changes less fragile.
Closing thoughts
Time namespaces are a fascinating example of virtualization at the core of the operating system. But for us as engineers, their real value is as a pattern library:
- Use pure functions and clear invariants for core logic.
- Guard lifecycle transitions with capabilities and one‑way doors.
- Make initialization lazy and idempotent to keep hot paths fast.
- Harden boundaries so the subsystem stays safe as requirements evolve.
If you’re designing your own isolation mechanism—whether for tenants in a SaaS platform, virtual clusters, or per‑request configuration—this file is worth treating as required reading. The Linux kernel team had to bend time itself, and they did it without letting the system fall off the rails. Our job is to bring that same care and discipline into whatever we build next.



