<header>
  <p>When we sketch LLM workflows, we draw tidy boxes and arrows. In production, we get retries, partial failures, streaming UIs, and users who expect to resume in the middle of everything. Somewhere between those diagrams and reality, we need a <mark>ledger</mark> that keeps the story straight. In LangGraph, that ledger lives in the Pregel runtime. In this article, I (Mahmoud Zalt) want us to see how Pregel turns a messy, concurrent LLM workflow into a sequence of consistent checkpoints — and what that design teaches us about building our own stateful systems.</p>
</header>

<nav aria-label="Mini table of contents">
  <ul>
    <li><a href="#pregel-as-ledger">Pregel as a state ledger</a></li>
    <li><a href="#nodes-and-channels">Nodes, channels, and the builder</a></li>
    <li><a href="#editing-the-ledger">Editing the ledger safely</a></li>
    <li><a href="#streaming-on-checkpoints">Streaming on top of checkpoints</a></li>
    <li><a href="#operational-guardrails">Operational guardrails and lessons</a></li>
  </ul>
</nav>

<section id="pregel-as-ledger">
  <h2>Pregel as a state ledger</h2>
  <p>To understand Pregel, it helps to stop thinking about it as “just an executor” and start seeing it as a <strong>state ledger</strong> for your graph. Every step, every task, every write is recorded, versioned, and replayable.</p>

  <figure>
    <pre><code>langgraph/
  pregel/
    _algo.py           # scheduling &amp; applying writes
    _loop.py           # SyncPregelLoop, AsyncPregelLoop (execution engine)
    _runner.py         # PregelRunner (per-task execution)
    main.py            # &lt;== this file: Pregel runtime &amp; NodeBuilder</code></pre>
    <figcaption>Pregel sits above the low-level loops and runners, and below the user-facing Graph APIs.</figcaption>
  </figure>

  <p class="why">Once you see Pregel as a ledger rather than a loop, its choices around checkpoints, streaming, and bulk updates become much easier to reason about.</p>

  <p>Conceptually, Pregel follows the <dfn>Bulk Synchronous Parallel</dfn> model: work happens in <em>steps</em>. In each step, a set of workers run in parallel; then everyone stops, applies their writes, and only then moves on. That barrier gives us a natural boundary where we can write a consistent snapshot through a checkpointer.</p>

  <p>The <code>Pregel</code> class in <a href="https://github.com/langchain-ai/langgraph/blob/main/libs/langgraph/langgraph/pregel/main.py" target="_blank" rel="noopener">main.py</a> orchestrates this:</p>
  <ul>
    <li>Drives <code>SyncPregelLoop</code>/<code>AsyncPregelLoop</code> to step the graph.</li>
    <li>Uses <code>PregelRunner</code> to run node logic (often LLM calls).</li>
    <li>Reads and writes checkpoints through a <code>BaseCheckpointSaver</code>.</li>
    <li>Exposes a state API: <code>get_state</code>, <code>get_state_history</code>, <code>bulk_update_state</code>, and their async variants.</li>
  </ul>

  <aside class="callout">If we think of each checkpoint as a page in a ledger, Pregel’s job is to decide when to turn the page, what to record, and how to let us annotate or correct those pages later without breaking history.</aside>
</section>

<section id="nodes-and-channels">
  <h2>Nodes, channels, and the builder</h2>
  <p>Once we treat Pregel as a ledger, we need a concrete mental model for what moves through it. A useful analogy from the code review is: think of <strong>channels as stations</strong> and <strong>nodes as trains</strong>. Each step is a scheduled departure.</p>

  <p>Pregel exposes a <code>NodeBuilder</code>, a fluent API for defining these trains: what stations they read from, what they write to, and how they behave.</p>

  <figure>
    <pre><code class="language-python">node1 = (
    NodeBuilder().subscribe_only("a")
    .do(lambda x: x + x)
    .write_to("b")
)</code></pre>
    <figcaption>A minimal node: subscribe to channel <code>"a"</code>, process, and write to <code>"b"</code>.</figcaption>
  </figure>

  <p>The core of this builder is small and focused. Here is the essence of <code>subscribe_only</code> and <code>build</code>:</p>

  <figure>
    <pre><code class="language-python">def subscribe_only(self, channel: str) -&gt; Self:
    """Subscribe to a single channel."""
    if not self._channels:
        self._channels = channel
    else:
        raise ValueError(
            "Cannot subscribe to single channels when other channels are already subscribed to"
        )

    self._triggers.append(channel)
    return self


def build(self) -&gt; PregelNode:
    """Builds the node."""
    return PregelNode(
        channels=self._channels,
        triggers=self._triggers,
        tags=self._tags,
        metadata=self._metadata,
        writers=[ChannelWrite(self._writes)],
        bound=self._bound,
        retry_policy=self._retry_policy,
        cache_policy=self._cache_policy,
    )</code></pre>
    <figcaption><code>NodeBuilder</code> turns a fluent description into a concrete <code>PregelNode</code> with channels, triggers, and writers.</figcaption>
  </figure>

  <p>In the trains-and-stations analogy:</p>
  <ul>
    <li><code>channels</code> describe which stations the train can load from.</li>
    <li><code>triggers</code> describe which station arrivals should schedule that train in the next step.</li>
    <li><code>writers</code> describe which stations receive cargo when the train finishes.</li>
  </ul>

  <aside class="callout">The builder enforces invariants early (for example, you cannot mix “single channel” with multi-channel) via clear <code>ValueError</code>s. Catching mismatches at construction time keeps the runtime ledger much easier to maintain.</aside>
</section>

<section id="editing-the-ledger">
  <h2>Editing the ledger safely</h2>
  <p>Pregel becomes most interesting when we need to <strong>edit the ledger</strong> after the fact: fixing bad state, replaying parts of a run, or seeding new branches. That is what <code>bulk_update_state</code> and <code>abulk_update_state</code> are for.</p>

  <p><code>bulk_update_state</code> takes a series of <em>supersteps</em>, each a list of <code>StateUpdate</code> objects. A <code>StateUpdate</code> is essentially: “pretend node X, task Y wrote these values.” Internally, Pregel:</p>
  <ol>
    <li>Loads (and possibly migrates) the latest checkpoint.</li>
    <li>Resolves which tasks and writers correspond to each update.</li>
    <li>Reuses or creates task IDs so history stays coherent.</li>
    <li>Applies writes through the same machinery as normal execution.</li>
    <li>Persists a new checkpoint with updated channel versions.</li>
  </ol>

  <p>One subtle part is <strong>disambiguating</strong> which node a single update belongs to when the caller omits <code>as_node</code>. Here is the core resolution logic:</p>

  <figure>
    <pre><code class="language-python">valid_updates: list[tuple[str, dict[str, Any] | None, str | None]] = []
if len(updates) == 1:
    values, as_node, task_id = updates[0]
    # find last node that updated the state, if not provided
    if as_node is None and len(self.nodes) == 1:
        as_node = tuple(self.nodes)[0]
    elif as_node is None and not any(
        v
        for vv in checkpoint["versions_seen"].values()
        for v in vv.values()
    ):
        if (
            isinstance(self.input_channels, str)
            and self.input_channels in self.nodes
        ):
            as_node = self.input_channels
    elif as_node is None:
        last_seen_by_node = sorted(
            (v, n)
            for n, seen in checkpoint["versions_seen"].items()
            if n in self.nodes
            for v in seen.values()
        )
        # if two nodes updated the state at the same time, it's ambiguous
        if last_seen_by_node:
            if len(last_seen_by_node) == 1:
                as_node = last_seen_by_node[0][1]
            elif last_seen_by_node[-1][0] != last_seen_by_node[-2][0]:
                as_node = last_seen_by_node[-1][1]
    if as_node is None:
        raise InvalidUpdateError("Ambiguous update, specify as_node")</code></pre>
    <figcaption>When the caller omits <code>as_node</code>, Pregel tries to infer it from history; if it cannot do so safely, it fails loudly.</figcaption>
  </figure>

  <p>The important principle: <strong>never guess silently.</strong> Pregel only infers <code>as_node</code> when there is a single clear candidate. As soon as two nodes might have written at the same logical time, it raises <code>InvalidUpdateError</code>. That discipline keeps the ledger trustworthy.</p>

  <details>
    <summary>Special supersteps: <code>END</code>, <code>INPUT</code>, and <code>"__copy__"</code></summary>
    <p>Besides ordinary “as this node” updates, <code>bulk_update_state</code> supports three special patterns:</p>
    <ul>
      <li><strong>Clear everything:</strong> <code>values=None</code> and <code>as_node == END</code> wipe tasks by calculating all writes that would flush the graph, then applying them.</li>
      <li><strong>Act as input:</strong> <code>as_node == INPUT</code> feeds <code>values</code> through <code>map_input</code> and persists it as if it were a real graph input.</li>
      <li><strong>Fork a checkpoint:</strong> <code>as_node == "__copy__"</code> creates a new checkpoint (a fork in the ledger) and can chain subsequent updates on top of it in the same call.</li>
    </ul>
    <p>All three reuse the same <code>apply_writes</code> and <code>create_checkpoint</code> machinery as normal execution, so manual edits and standard runs share the same semantics.</p>
  </details>

  <aside class="callout">This is “dangerous power with guardrails”: you can surgically edit live graph state, but the code defends itself with explicit validation and unambiguous error paths.</aside>
</section>

<section id="streaming-on-checkpoints">
  <h2>Streaming on top of checkpoints</h2>
  <p>So far, we have treated Pregel as a batch ledger. Most real workflows need streaming: partial outputs, token streams, and debug traces. The key design choice is that Pregel builds streaming <em>on top of</em> the same step-and-checkpoint model instead of inventing a separate pipeline.</p>

  <p>The synchronous <code>stream</code> method wires the pieces together. The inner loop looks like this:</p>

  <figure>
    <pre><code class="language-python">while loop.tick():
    for task in loop.match_cached_writes():
        loop.output_writes(task.id, task.writes, cached=True)
    for _ in runner.tick(
        [t for t in loop.tasks.values() if not t.writes],
        timeout=self.step_timeout,
        get_waiter=get_waiter,
        schedule_task=loop.accept_push,
    ):
        # emit output
        yield from _output(
            stream_mode, print_mode, subgraphs, stream.get, queue.Empty
        )
    loop.after_tick()
    # wait for checkpoint
    if durability_ == "sync":
        loop._put_checkpoint_fut.result()</code></pre>
    <figcaption>Streaming is reading from a queue that the loop fills as steps and tasks complete.</figcaption>
  </figure>

  <p>This enforces two invariants:</p>
  <ul>
    <li><strong>Step boundaries are explicit.</strong> Channel updates from step N become visible only when we move to step N+1. That is why we can take consistent snapshots at each step.</li>
    <li><strong>Streaming is event-based.</strong> The loop pushes <code>StreamChunk</code>s into a queue; <code>_output</code> drains that queue, optionally prints for debug, and yields to the caller.</li>
  </ul>

  <p>A dedicated <code>StreamMessagesHandler</code> is attached when you enable <code>stream_mode="messages"</code>, streaming LLM tokens and metadata. Custom streaming goes through a <code>Runtime.stream_writer</code> callback that pushes <code>(namespace, "custom", payload)</code> tuples into the same queue.</p>

  <aside class="callout">One improvement proposed in the review is to replace raw <code>print</code> calls in <code>_output</code> with a logger, so production services can control verbosity and protect PII through centralized logging policy.</aside>
</section>

<section id="operational-guardrails">
  <h2>Operational guardrails and lessons</h2>
  <p>Once we see Pregel as a ledger with streaming layered on top, operational concerns become easy to frame: how many pages we add, how big they are, and how long each write and stream takes.</p>

  <table>
    <thead>
      <tr>
        <th>Metric</th>
        <th>What it tells us</th>
        <th>Suggested target</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><code>pregel_steps_per_run</code></td>
        <td>How many steps each execution needs; high values hint at inefficient or looping graphs.</td>
        <td>Most workflows &lt; 50 steps; investigate if &gt; 200 regularly.</td>
      </tr>
      <tr>
        <td><code>checkpoint_write_latency_ms</code></td>
        <td>How long <code>checkpointer.put/aput</code> takes; directly affects latency in <code>durability="sync"</code> mode.</td>
        <td>P50 &lt; 50ms, P95 &lt; 200ms.</td>
      </tr>
      <tr>
        <td><code>stream_queue_depth</code></td>
        <td>Current size of the <code>SyncQueue</code>/<code>AsyncQueue</code>; a proxy for backpressure.</td>
        <td>Keep under ~100 items under normal load.</td>
      </tr>
      <tr>
        <td><code>bulk_update_superstep_duration_ms</code></td>
        <td>How long a single bulk update superstep takes; useful when external tools edit graph state.</td>
        <td>&lt; 200ms per superstep in interactive scenarios.</td>
      </tr>
    </tbody>
  </table>

  <p>On the safety side, Pregel encodes several guardrails:</p>
  <ul>
    <li><strong>Recursion limit:</strong> if the graph burns through too many steps without stopping, it raises <code>GraphRecursionError</code> with a specific error code.</li>
    <li><strong>Durability contract:</strong> <code>durability</code> has no effect unless a checkpointer is configured, and deprecated options like <code>checkpoint_during</code> are guarded explicitly.</li>
    <li><strong>Namespace hygiene:</strong> subgraph checkpoint namespaces are normalized via <code>recast_checkpoint_ns</code> so parent and child graphs do not overwrite each other’s history.</li>
  </ul>

  <aside class="callout">Operational work is much easier when the runtime is opinionated: Pregel refuses to run some flows without a checkpointer and loudly complains about ambiguous bulk updates. That is policy enforced by code, not just by documentation.</aside>

  <h3>Design lessons you can reuse</h3>
  <ol>
    <li>
      <strong>Make step boundaries explicit.</strong>
      <p>Just like Pregel’s “channel updates from step N become visible in N+1”, define clear phases in your workflows (plan → execute → commit). It simplifies reasoning about concurrency and makes checkpointing natural.</p>
    </li>
    <li>
      <strong>Expose a safe manual-edit path.</strong>
      <p><code>bulk_update_state</code> is a controlled edit interface into the ledger. Consider offering a similar API for your state: it lets operators fix issues and build admin tools without poking your database directly — if you enforce strong validation and unambiguous semantics.</p>
    </li>
    <li>
      <strong>Design sync and async together.</strong>
      <p>Pregel keeps tight parity between <code>stream/astream</code>, <code>invoke/ainvoke</code>, and <code>bulk_update_state/abulk_update_state</code>. When you add features, design the sync/async story together so you do not end up with two subtly different runtimes.</p>
    </li>
    <li>
      <strong>Treat observability as part of the API.</strong>
      <p>Stream modes (<code>"values"</code>, <code>"updates"</code>, <code>"messages"</code>, <code>"tasks"</code>, <code>"checkpoints"</code>, <code>"debug"</code>) are part of the public surface, not bolted on later. Think of logs, metrics, and streams as first-class outputs of your system, not just side effects.</p>
    </li>
  </ol>

  <p>Viewed as a ledger, Pregel turns a complex LLM workflow into a sequence of carefully written pages you can always come back to: clear step boundaries, explicit state transitions, and safe ways to read and edit history. If we design our own runtimes with the same mindset, they become much easier to scale, debug, and evolve.</p>
</section>