Decoding Linux Boot: start_kernel
A modern Linux system brings up CPUs, memory, filesystems, and user space in seconds. Under the hood, a single C file directs this symphony. Let’s open it up.
Welcome! I’m Mahmoud Zalt. In this article, we’ll examine init/main.c from the Linux kernel—the boot-time conductor that parses the command line, initializes subsystems via initcall levels, and launches PID 1. Linux is primarily C, built with GCC/Clang for multiple architectures (x86, arm64, and beyond). This file matters because it sequences the earliest—and riskiest—moments of system life: from interrupts and scheduling to finally running init.
By the end, you’ll understand how this file works, what’s brilliant in its design, where to improve maintainability and developer experience, and how to watch performance at scale. Roadmap: How It Works → What’s Brilliant → Areas for Improvement → Performance at Scale → Conclusion.
How It Works
To appreciate the later guidance, let’s first see the structure of boot orchestration and the guarantees it enforces.
Primary responsibilities
- Parse early and normal kernel command-line parameters and optional bootconfig.
- Initialize subsystems in a defined sequence via ordered initcall levels.
- Carefully enable interrupts and progress the global
system_state. - Spawn fundamental kernel threads (notably
kthreadd) and execute the userspace init process (PID 1). - Finalize safety features (e.g., read-only rodata), free
__initmemory, and transition the kernel to running state.
kernel/arch entry
|
v
start_kernel()
|-- setup_arch() -> arch-specific
|-- setup_boot_config()/setup_command_line()
|-- parse_early_param()/parse_args()
|-- init of core subsystems (RCU, IRQ, timers, timekeeping, ...)
|-- console_init()
|-- do_pre_smp_initcalls()
v
rest_init()
|-- user_mode_thread(kernel_init) --> PID 1 (init)
|-- kernel_thread(kthreadd) --> kthreadd
v
kernel_init_freeable()
|-- smp_init()/sched_init_smp()
|-- do_basic_setup() -> do_initcalls() by level
|-- wait_for_initramfs(), console_on_rootfs()
|-- integrity_load_keys()
v
kernel_init()
|-- free_initmem(), mark_readonly(), pti_finalize()
|-- run_init_process() (rdinit/init fallbacks)
v
SYSTEM_RUNNING
start_kernel to PID 1.Data flow and invariants
The raw command line (boot_command_line) plus optional bootconfig are combined in setup_command_line to produce saved_command_line and static_command_line. Early parameters are parsed via parse_early_param() and later arguments via parse_args(). Unrecognized options are forwarded to user space through argv_init/envp_init (both NULL-terminated, bounded by CONFIG_INIT_ENV_ARG_LIMIT). The system enforces invariants like:
early_boot_irqs_disabledis true until the kernel deliberately enables interrupts.system_statemonotonically progresses fromSYSTEM_SCHEDULINGtoSYSTEM_RUNNING, with aSYSTEM_FREEING_INITMEMphase in between.- PID 1 is always assigned to init.
- Initcalls must not return with IRQs disabled or with a preemption imbalance.
start_kernel: the boot-time template method
The main orchestration happens inside start_kernel: it disables interrupts, sets up CPU and memory basics, initializes logging/tracing, reads and parses parameters, and prepares core subsystems. Then it hands off to rest_init to spin up kthreadd and the init task.
asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
void start_kernel(void)
{
char *command_line;
char *after_dashes;
set_task_stack_end_magic(&init_task);
smp_setup_processor_id();
debug_objects_early_init();
init_vmlinux_build_id();
cgroup_init_early();
local_irq_disable();
early_boot_irqs_disabled = true;
...
console_init();
if (panic_later)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);
...
rest_init();
...
}
start_kernel is the kernel’s template method for boot sequencing. It sets safety preconditions (IRQs off), performs core setup, parses params, and finally delegates to rest_init to begin life as a multitasking system.
rest_init: establishing PID 1 and kthreadd
rest_init pins the init task to the boot CPU, starts kthreadd, moves system_state to SYSTEM_SCHEDULING, and transitions to the CPU startup entry, letting the scheduler take over.
Initcalls and ordering guarantees
Subsystems register their initialization via initcall tables; the orchestrator calls them layer by layer. The kernel traces and guards each call.
int __init_or_module do_one_initcall(initcall_t fn)
{
int count = preempt_count();
char msgbuf[64];
int ret;
if (initcall_blacklisted(fn))
return -EPERM;
do_trace_initcall_start(fn);
ret = fn();
do_trace_initcall_finish(fn, ret);
msgbuf[0] = 0;
if (preempt_count() != count) {
sprintf(msgbuf, "preemption imbalance ");
preempt_count_set(count);
}
if (irqs_disabled()) {
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
local_irq_enable();
}
WARN(msgbuf[0], "initcall %pS returned with %s\n", fn, msgbuf);
add_latent_entropy();
return ret;
}
Each initcall is traced, blacklisted if configured, and audited for IRQ/preemption invariants. Violations are corrected and warned, preventing fragile boot regressions.
What are initcall levels and why do they matter?
Initcalls are grouped into levels like pure, core, postcore, arch, subsys, fs, device, and late. The boot code iterates these in order. This declares coarse-grained dependencies without hard-coding function order. If your subsystem needs VFS, choose fs or later. If you depend on IRQs and timers, pick a level after they’re initialized. The framework scales across architectures and configurations without entangling modules.
What’s Brilliant
Having seen the flow, let’s spotlight several design choices that excel in reliability and extensibility.
- Inversion of control via initcall registry: Subsystems self-register. The boot orchestrator never needs to “know” every participant. This supports rich configurations without a combinatorial explosion of conditionals.
- Template method structure in
start_kernel: The code reads like a boot checklist, enforcing an intentional order while isolating complexity to helpers. Even with inherent length, it remains followable through phases: early safety, arch setup, param parsing, core init, scheduling enablement, and hand-off. - Observable by design: Tracepoints (
initcall start/finish/level) andinitcall_debugoffer latency visibility for each stage. Developers can pinpoint slowdowns with confidence. - Safety rails in
do_one_initcall: Guards reset IRQ and preemption imbalances. A single errant initcall can’t silently poison the rest of boot. - Bootconfig integration: Optional boot-time configuration can merge additional
kernel.*params andinit.*args, with checksum verification and clear precedence—useful for complex deployments or factory configurations. - Thoughtful PID 1 fallback sequence: The kernel tries
rdinit, theninit=, then a series of well-known init paths, finally a shell. This prevents bricking a system due to misconfiguration.
Developer experience: unknown options pass-through
Unknown kernel parameters aren’t discarded—they’re forwarded to user space via argv_init/envp_init, and the kernel logs a summary once parsing finishes. This default-to-safe policy keeps experimentation simple for operators and distro initramfs authors.
Extensibility hooks
early_param(),__setup(), and boot-time static keys let you inject features without contorting the core boot flow.- Weak hooks like
arch_post_acpi_subsys_initallow architectures to customize behavior without forking the orchestrator. - Initcall blacklisting provides a surgical switch-off lever during bisection and bring-up.
Areas for Improvement
Even a workhorse like init/main.c benefits from continual polish. Here’s what I’d prioritize for maintainability and developer confidence.
| Smell | Impact | Fix |
|---|---|---|
Very long function (start_kernel) |
Higher cognitive load; subtle ordering bugs are harder to review. | Extract coherent phases into small helpers (e.g., early RNG/log/tracing setup). |
Global mutable state (system_state, early_boot_irqs_disabled) |
Tight coupling; risk of accidental misuse. | Constrain updates to narrow helpers and add assertions around transitions. |
| In-place command-line mutation | Harder to reason about parameter lifetimes and side effects. | Document invariants and expand KUnit coverage for edge cases. |
| Multiple init-arg sources (bootconfig, cmdline, “--”) | Operator confusion; potential conflicts. | Log a clear summary of merged sources and precedence at boot. |
Refactor: Extract early RNG/log/tracing setup
This small extraction shortens start_kernel and groups tightly related steps while preserving order. It’s a low-risk readability win.
--- a/init/main.c
+++ b/init/main.c
@@ void start_kernel(void)
- random_init_early(command_line);
- setup_log_buf(0);
- ftrace_init();
- early_trace_init();
+ init_early_rng_log_trace(command_line);
@@
+static __init void init_early_rng_log_trace(char *command_line)
+{
+ random_init_early(command_line);
+ setup_log_buf(0);
+ ftrace_init();
+ early_trace_init();
+}
Isolating a coherent phase reduces visual noise in start_kernel and makes future changes to early tracing/logging easier to reason about.
Guard transitions with assertions
Boot invariants are precious. Adding a diagnostic check at key transitions (e.g., in rest_init) can catch regressions early without altering behavior.
Example: warn if IRQs aren’t in the expected state at the scheduling phase boundary.
Test plan: KUnit + QEMU
Some of the trickiest bugs hide in parsing and in the interaction of multiple init-arg sources. The following cases are high value:
- Unknown options pass-through: Boot a kernel with a cmdline like
foo=bar baz quux.env=1and verify that env/argv forwarding matches expectations, with a single log about unknown parameters passed to user space. - Bootconfig checksum and merge: Embed a bootconfig in initrd, pass
bootconfigon the cmdline, validate checksum, and verify thatkernel.*keys are merged into the command line andinit.*into init args. Corrupt the checksum to observe the error path. - Initcall blacklist: With
initcall_blacklist=, ensure the blacklisted initcall is skipped and reported. - PID 1 fallbacks: With a bad
rdinitand no/sbin/init, confirm the final fallback to/bin/sh.
Performance at Scale
With modern kernels and rich hardware, boot performance hinges on initcall cost, firmware behavior, and I/O during initramfs/rootfs bring-up. Observability is your friend here.
Hot paths and latency risks
- start_kernel: One-time, latency-critical setup.
- do_initcalls: Linear in the number of initcalls; the cost is dominated by individual subsystem initialization work.
- run_init_process: The transition to PID 1; failures or path search can show up as user-visible delays.
Risks include slow firmware/ACPI init, heavyweight device probing, long console output (on slow serial consoles), and insufficient entropy before crypto consumers start.
Metrics to instrument
boot.initcall_level_duration_seconds{level}: Track the duration of each initcall level. Establish baselines on reference hardware and alert on >2x regressions.boot.initcall_failures_total: Should be zero; a non-zero value is a boot failure signal.boot.time_to_pid1_seconds: End-to-end latency to executing PID 1. Maintain a regression budget (for example, ±5%).boot.entropy_bits_available_at_random_init: Ensure entropy meets security thresholds before enabling dependent subsystems.
Logs, traces, and alerts
- Logs: Kernel command line echo, unknown parameter forwarding notice, and any errors while opening
/dev/consoleor executing init. - Tracepoints:
initcall start/finish/leveltrace events and ftrace function graph aroundstart_kernelanddo_initcalls. - Alerts: Boot time regression against baseline, non-zero initcall failures, missing working init (panic), or entropy below threshold past
random_init.
Security-minded performance
The file also finalizes memory protection—e.g., making rodata read-only and completing PTI setup—after freeing __init sections. These steps should be visible in boot logs and, if possible, reflected in a metric/event so security posture changes are auditable across builds.
Conclusion
We’ve walked from the boot CPU’s first moments to a running system, guided by init/main.c. Three takeaways stand out:
- Clarity through structure: The template-method sequencing and initcall levels keep the kernel boot scalable and understandable, even across architectures.
- Safety and observability: Guardrails in
do_one_initcall, plus tracepoints andinitcall_debug, reduce the blast radius of boot-time bugs and make regressions tractable. - Pragmatic refinements: Small extractions in
start_kernel, explicit state transition checks, and targeted KUnit + QEMU tests will improve maintainability and DX without risking ordering guarantees.
If you contribute to boot-time code, keep the invariants close, add visibility when in doubt, and preserve order while extracting cohesive phases. Your future self—and the next engineer debugging a tricky boot—will thank you.



