Decoding Linux Boot: start_kernel

A modern Linux system brings up CPUs, memory, filesystems, and user space in seconds. Under the hood, a single C file directs this symphony. Let’s open it up.

Welcome! I’m Mahmoud Zalt. In this article, we’ll examine init/main.c from the Linux kernel—the boot-time conductor that parses the command line, initializes subsystems via initcall levels, and launches PID 1. Linux is primarily C, built with GCC/Clang for multiple architectures (x86, arm64, and beyond). This file matters because it sequences the earliest—and riskiest—moments of system life: from interrupts and scheduling to finally running init.

By the end, you’ll understand how this file works, what’s brilliant in its design, where to improve maintainability and developer experience, and how to watch performance at scale. Roadmap: How It Works → What’s Brilliant → Areas for Improvement → Performance at Scale → Conclusion.

How It Works

To appreciate the later guidance, let’s first see the structure of boot orchestration and the guarantees it enforces.

Primary responsibilities

Parse early and normal kernel command-line parameters and optional bootconfig.
Initialize subsystems in a defined sequence via ordered initcall levels.
Carefully enable interrupts and progress the global system_state.
Spawn fundamental kernel threads (notably kthreadd) and execute the userspace init process (PID 1).
Finalize safety features (e.g., read-only rodata), free __init memory, and transition the kernel to running state.

kernel/arch entry
   |
   v
start_kernel()
   |-- setup_arch() -> arch-specific
   |-- setup_boot_config()/setup_command_line()
   |-- parse_early_param()/parse_args()
   |-- init of core subsystems (RCU, IRQ, timers, timekeeping, ...)
   |-- console_init()
   |-- do_pre_smp_initcalls()
   v
rest_init()
   |-- user_mode_thread(kernel_init)  --> PID 1 (init)
   |-- kernel_thread(kthreadd)        --> kthreadd
   v
kernel_init_freeable()
   |-- smp_init()/sched_init_smp()
   |-- do_basic_setup() -> do_initcalls() by level
   |-- wait_for_initramfs(), console_on_rootfs()
   |-- integrity_load_keys()
   v
kernel_init()
   |-- free_initmem(), mark_readonly(), pti_finalize()
   |-- run_init_process() (rdinit/init fallbacks)
   v
SYSTEM_RUNNING

High-level boot flow, from start_kernel to PID 1.

Data flow and invariants

The raw command line (boot_command_line) plus optional bootconfig are combined in setup_command_line to produce saved_command_line and static_command_line. Early parameters are parsed via parse_early_param() and later arguments via parse_args(). Unrecognized options are forwarded to user space through argv_init/envp_init (both NULL-terminated, bounded by CONFIG_INIT_ENV_ARG_LIMIT). The system enforces invariants like:

early_boot_irqs_disabled is true until the kernel deliberately enables interrupts.
system_state monotonically progresses from SYSTEM_SCHEDULING to SYSTEM_RUNNING, with a SYSTEM_FREEING_INITMEM phase in between.
PID 1 is always assigned to init.
Initcalls must not return with IRQs disabled or with a preemption imbalance.

start_kernel: the boot-time template method

The main orchestration happens inside start_kernel: it disables interrupts, sets up CPU and memory basics, initializes logging/tracing, reads and parses parameters, and prepares core subsystems. Then it hands off to rest_init to spin up kthreadd and the init task.

Excerpt from start_kernel (approx. L520–L560). View on GitHub

asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
void start_kernel(void)
{
	char *command_line;
	char *after_dashes;

	set_task_stack_end_magic(&init_task);
	smp_setup_processor_id();
	debug_objects_early_init();
	init_vmlinux_build_id();

	cgroup_init_early();

	local_irq_disable();
	early_boot_irqs_disabled = true;
	...
	console_init();
	if (panic_later)
		panic("Too many boot %s vars at `%s'", panic_later,
		      panic_param);
	...
	rest_init();
	...
}

start_kernel is the kernel’s template method for boot sequencing. It sets safety preconditions (IRQs off), performs core setup, parses params, and finally delegates to rest_init to begin life as a multitasking system.

rest_init: establishing PID 1 and kthreadd

rest_init pins the init task to the boot CPU, starts kthreadd, moves system_state to SYSTEM_SCHEDULING, and transitions to the CPU startup entry, letting the scheduler take over.

Initcalls and ordering guarantees

Subsystems register their initialization via initcall tables; the orchestrator calls them layer by layer. The kernel traces and guards each call.

Initcall invocation with safety checks (approx. L760–L790). View on GitHub

int __init_or_module do_one_initcall(initcall_t fn)
{
	int count = preempt_count();
	char msgbuf[64];
	int ret;

	if (initcall_blacklisted(fn))
		return -EPERM;

	do_trace_initcall_start(fn);
	ret = fn();
	do_trace_initcall_finish(fn, ret);

	msgbuf[0] = 0;

	if (preempt_count() != count) {
		sprintf(msgbuf, "preemption imbalance ");
		preempt_count_set(count);
	}
	if (irqs_disabled()) {
		strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
		local_irq_enable();
	}
	WARN(msgbuf[0], "initcall %pS returned with %s\n", fn, msgbuf);

	add_latent_entropy();
	return ret;
}

Each initcall is traced, blacklisted if configured, and audited for IRQ/preemption invariants. Violations are corrected and warned, preventing fragile boot regressions.

What are initcall levels and why do they matter?

Initcalls are grouped into levels like pure, core, postcore, arch, subsys, fs, device, and late. The boot code iterates these in order. This declares coarse-grained dependencies without hard-coding function order. If your subsystem needs VFS, choose fs or later. If you depend on IRQs and timers, pick a level after they’re initialized. The framework scales across architectures and configurations without entangling modules.

What’s Brilliant

Having seen the flow, let’s spotlight several design choices that excel in reliability and extensibility.

Inversion of control via initcall registry: Subsystems self-register. The boot orchestrator never needs to “know” every participant. This supports rich configurations without a combinatorial explosion of conditionals.
Template method structure in start_kernel: The code reads like a boot checklist, enforcing an intentional order while isolating complexity to helpers. Even with inherent length, it remains followable through phases: early safety, arch setup, param parsing, core init, scheduling enablement, and hand-off.
Observable by design: Tracepoints (initcall start/finish/level) and initcall_debug offer latency visibility for each stage. Developers can pinpoint slowdowns with confidence.
Safety rails in do_one_initcall: Guards reset IRQ and preemption imbalances. A single errant initcall can’t silently poison the rest of boot.
Bootconfig integration: Optional boot-time configuration can merge additional kernel.* params and init.* args, with checksum verification and clear precedence—useful for complex deployments or factory configurations.
Thoughtful PID 1 fallback sequence: The kernel tries rdinit, then init=, then a series of well-known init paths, finally a shell. This prevents bricking a system due to misconfiguration.

Developer experience: unknown options pass-through

Unknown kernel parameters aren’t discarded—they’re forwarded to user space via argv_init/envp_init, and the kernel logs a summary once parsing finishes. This default-to-safe policy keeps experimentation simple for operators and distro initramfs authors.

Extensibility hooks

early_param(), __setup(), and boot-time static keys let you inject features without contorting the core boot flow.
Weak hooks like arch_post_acpi_subsys_init allow architectures to customize behavior without forking the orchestrator.
Initcall blacklisting provides a surgical switch-off lever during bisection and bring-up.

Areas for Improvement

Even a workhorse like init/main.c benefits from continual polish. Here’s what I’d prioritize for maintainability and developer confidence.

Smell	Impact	Fix
Very long function (`start_kernel`)	Higher cognitive load; subtle ordering bugs are harder to review.	Extract coherent phases into small helpers (e.g., early RNG/log/tracing setup).
Global mutable state (`system_state`, `early_boot_irqs_disabled`)	Tight coupling; risk of accidental misuse.	Constrain updates to narrow helpers and add assertions around transitions.
In-place command-line mutation	Harder to reason about parameter lifetimes and side effects.	Document invariants and expand KUnit coverage for edge cases.
Multiple init-arg sources (bootconfig, cmdline, “--”)	Operator confusion; potential conflicts.	Log a clear summary of merged sources and precedence at boot.

Refactor: Extract early RNG/log/tracing setup

This small extraction shortens start_kernel and groups tightly related steps while preserving order. It’s a low-risk readability win.

Suggested refactor (diff). Maintain call order exactly.

--- a/init/main.c
+++ b/init/main.c
@@ void start_kernel(void)
-    random_init_early(command_line);
-    setup_log_buf(0);
-    ftrace_init();
-    early_trace_init();
+    init_early_rng_log_trace(command_line);
@@
+static __init void init_early_rng_log_trace(char *command_line)
+{
+    random_init_early(command_line);
+    setup_log_buf(0);
+    ftrace_init();
+    early_trace_init();
+}

Isolating a coherent phase reduces visual noise in start_kernel and makes future changes to early tracing/logging easier to reason about.

Guard transitions with assertions

Boot invariants are precious. Adding a diagnostic check at key transitions (e.g., in rest_init) can catch regressions early without altering behavior.

Example: warn if IRQs aren’t in the expected state at the scheduling phase boundary.

Test plan: KUnit + QEMU

Some of the trickiest bugs hide in parsing and in the interaction of multiple init-arg sources. The following cases are high value:

Unknown options pass-through: Boot a kernel with a cmdline like foo=bar baz quux.env=1 and verify that env/argv forwarding matches expectations, with a single log about unknown parameters passed to user space.
Bootconfig checksum and merge: Embed a bootconfig in initrd, pass bootconfig on the cmdline, validate checksum, and verify that kernel.* keys are merged into the command line and init.* into init args. Corrupt the checksum to observe the error path.
Initcall blacklist: With initcall_blacklist=, ensure the blacklisted initcall is skipped and reported.
PID 1 fallbacks: With a bad rdinit and no /sbin/init, confirm the final fallback to /bin/sh.

Performance at Scale

With modern kernels and rich hardware, boot performance hinges on initcall cost, firmware behavior, and I/O during initramfs/rootfs bring-up. Observability is your friend here.

Hot paths and latency risks

start_kernel: One-time, latency-critical setup.
do_initcalls: Linear in the number of initcalls; the cost is dominated by individual subsystem initialization work.
run_init_process: The transition to PID 1; failures or path search can show up as user-visible delays.

Risks include slow firmware/ACPI init, heavyweight device probing, long console output (on slow serial consoles), and insufficient entropy before crypto consumers start.

Metrics to instrument

boot.initcall_level_duration_seconds{level}: Track the duration of each initcall level. Establish baselines on reference hardware and alert on >2x regressions.
boot.initcall_failures_total: Should be zero; a non-zero value is a boot failure signal.
boot.time_to_pid1_seconds: End-to-end latency to executing PID 1. Maintain a regression budget (for example, ±5%).
boot.entropy_bits_available_at_random_init: Ensure entropy meets security thresholds before enabling dependent subsystems.

Logs, traces, and alerts

Logs: Kernel command line echo, unknown parameter forwarding notice, and any errors while opening /dev/console or executing init.
Tracepoints: initcall start/finish/level trace events and ftrace function graph around start_kernel and do_initcalls.
Alerts: Boot time regression against baseline, non-zero initcall failures, missing working init (panic), or entropy below threshold past random_init.

Security-minded performance

The file also finalizes memory protection—e.g., making rodata read-only and completing PTI setup—after freeing __init sections. These steps should be visible in boot logs and, if possible, reflected in a metric/event so security posture changes are auditable across builds.

Conclusion

We’ve walked from the boot CPU’s first moments to a running system, guided by init/main.c. Three takeaways stand out:

Clarity through structure: The template-method sequencing and initcall levels keep the kernel boot scalable and understandable, even across architectures.
Safety and observability: Guardrails in do_one_initcall, plus tracepoints and initcall_debug, reduce the blast radius of boot-time bugs and make regressions tractable.
Pragmatic refinements: Small extractions in start_kernel, explicit state transition checks, and targeted KUnit + QEMU tests will improve maintainability and DX without risking ordering guarantees.

If you contribute to boot-time code, keep the invariants close, add visibility when in doubt, and preserve order while extracting cohesive phases. Your future self—and the next engineer debugging a tricky boot—will thank you.

Zalt Blog

Decoding Linux Boot: start_kernel

Decoding Linux Boot: start_kernel

How It Works

start_kernel: the boot-time template method

rest_init: establishing PID 1 and kthreadd

Initcalls and ordering guarantees

What’s Brilliant

Developer experience: unknown options pass-through

Extensibility hooks

Areas for Improvement

Refactor: Extract early RNG/log/tracing setup

Guard transitions with assertions

Test plan: KUnit + QEMU

Performance at Scale

Hot paths and latency risks

Metrics to instrument

Logs, traces, and alerts

Security-minded performance

Conclusion

Full Source Code

About the Author

Support this content

Share this article

Read More

Why Transformers Imports Feel Lightweight

When One Class Runs Your Cluster