Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

How Prometheus Keeps Its TSDB Sane

By محمود الزلط
Code Cracking
30m read
<

Working with time-series data at scale? “How Prometheus Keeps Its TSDB Sane” breaks down how Prometheus keeps its own storage manageable and safe.

/>
How Prometheus Keeps Its TSDB Sane - Featured blog post image

MENTORING

1:1 engineering mentorship.

Architecture, AI systems, career growth. Ongoing or one-off.

Every successful system eventually hits the same problem: the storage layer turns into a beast. Prometheus is no exception. Its time-series database (TSDB) ingests unbounded streams of metrics, answers arbitrary queries, repairs itself after crashes, and still has to stay fast and safe. Here, we’ll walk through how Prometheus’ core DB type keeps that beast under control.

We’ll focus on tsdb/db.go as a case study in how to orchestrate a complex storage engine without losing your sanity. The TSDB’s DB doesn’t implement the low-level data structures; it coordinates them. Understanding that coordination is the main lesson.

I’m Mahmoud Zalt, an AI solutions architect. I help engineering leaders turn complex systems—especially those touched by AI and data—into something they can reason about and evolve. Prometheus’ TSDB is a great example of that kind of deliberate design.

DB as an air-traffic controller

Prometheus’ TSDB is not one monolith; it’s a set of components that each do one thing well:

  • Head is the busy runway and terminal — fresh data in memory plus the write-ahead log (WAL).
  • Blocks on disk are the hangars — immutable archives of older samples.
  • Compactor is ground control moving planes from the runway to hangars, merging and cleaning up.
  • Retention is airport capacity planning — deciding which old planes to scrap.

The DB type in tsdb/db.go is the air-traffic controller that coordinates all of this. It doesn’t implement the details of Head or Block, but it decides when and how they move and interact.

tsdb/
  db.go                # Core DB orchestration (this file)
  head.go              # In-memory head block & WAL logic
  block.go             # On-disk block representation
  chunks/              # Chunk files and mmap helpers
  wlog/                # WAL and WBL implementation

Open DB ->
  +-> DirLocker, WAL/WBL
  +-> Compactor
  +-> Head.Init (WAL replay)
  +-> reloadBlocks
  +-> go db.run()
DB sits above Head, Block, WAL/WBL, and Compactor, orchestrating their lifecycles.

The central story in this file is not about a clever data structure; it’s about coordinating many moving parts safely: writes, compactions, deletions, queries, crashes, and out-of-order data. Everything else in this article is in service of that orchestration lesson.

Lifecycles, locks, and background routines

Once we see DB as an orchestrator, the next question is how it stays sane at runtime: how it protects shared state, runs background work, and shuts down cleanly. This is where the design either gives us confidence or keeps us awake at night.

The core DB state and lock partitioning

At the heart of DB is a set of fields and mutexes that encode its responsibilities:

type DB struct {
    dir    string
    locker *tsdbutil.DirLocker

    logger  *slog.Logger
    metrics *dbMetrics
    opts    *Options

    chunkPool      chunkenc.Pool
    compactor      Compactor
    blocksToDelete BlocksToDeleteFunc

    // mtx protects the block list and mmap GC state.
    mtx    sync.RWMutex
    blocks []*Block

    lastGarbageCollectedMmapRef chunks.ChunkDiskMapperRef

    head *Head

    compactc chan struct{}
    donec    chan struct{}
    stopc    chan struct{}

    // cmtx ensures that compactions and deletions don't run simultaneously.
    cmtx sync.Mutex

    // autoCompactMtx protects autoCompaction toggling.
    autoCompactMtx sync.Mutex
    autoCompact    bool

    // retentionMtx protects retention config values updated at runtime.
    retentionMtx sync.RWMutex

    compactCancel context.CancelFunc

    timeWhenCompactionDelayStarted time.Time
}

Three design ideas carry most of the weight here:

  1. Explicit mutex partitioning. mtx guards the block layout and mmap GC ref, cmtx serializes compaction and deletion, retentionMtx protects retention settings, autoCompactMtx guards the auto-compaction flag. Each lock has a clearly scoped concern, which controls contention and makes concurrency intent obvious.
  2. Channels as signals, not work queues. compactc is a “you should compact” signal, not a job queue. Writers send to a buffered channel, but actual compaction is serialized behind cmtx. Intent and execution are decoupled.
  3. Cancellation is baked in. compactCancel, stopc, and donec give long‑running tasks a clear, centralized shutdown path.

The background run loop

When a DB is opened, it launches a single caretaker goroutine, run, that coordinates periodic work and reacts to compaction signals:

func (db *DB) run(ctx context.Context) {
    defer close(db.donec)

    backoff := time.Duration(0)

    for {
        select {
        case <-db.stopc:
            return
        case <-time.After(backoff):
        }

        select {
        case <-time.After(db.opts.BlockReloadInterval):
            db.cmtx.Lock()
            if err := db.reloadBlocks(); err != nil {
                db.logger.Error("reloadBlocks", "err", err)
            }
            db.cmtx.Unlock()

            // Nudge compaction if needed.
            select {
            case db.compactc <- struct{}{}:
            default:
            }

            db.head.mmapHeadChunks()

            // Potentially trigger stale-series compaction here.

        case <-db.compactc:
            db.metrics.compactionsTriggered.Inc()

            db.autoCompactMtx.Lock()
            if db.autoCompact {
                if err := db.Compact(ctx); err != nil {
                    db.logger.Error("compaction failed", "err", err)
                    backoff = exponential(backoff, time.Second, time.Minute)
                } else {
                    backoff = 0
                }
            } else {
                db.metrics.compactionsSkipped.Inc()
            }
            db.autoCompactMtx.Unlock()
        case <-db.stopc:
            return
        }
    }
}

In plain language, this loop:

  • Periodically reloads blocks from disk under cmtx, nudges compaction by sending on compactc, and mmaps head chunks to control memory usage.
  • Listens for compaction signals from writers or from the periodic tick, and runs Compact with exponential backoff on failure.
  • Stops cleanly when stopc is closed, signaling all background work to exit.

This pattern — a single background loop that owns scheduling and coordination of maintenance tasks — is one of the key reusable ideas in this file.

Compaction and retention as safe garbage collection

With the runtime model in place, we can zoom in on the most delicate work: turning in‑memory data into immutable blocks, merging older blocks, and safely deleting what we no longer need. Prometheus treats this as a kind of garbage collection cycle, not just housekeeping.

Compaction as a GC cycle

A useful mental model is generational garbage collection:

  • The Head is the “young generation” where new samples arrive and change quickly.
  • On-disk blocks are “older generations” that change only via compaction.
  • Compaction periodically promotes data from head to blocks and merges older blocks.

The top-level GC cycle is Compact:

// Compact data if possible.
func (db *DB) Compact(ctx context.Context) (returnErr error) {
    db.cmtx.Lock()
    defer db.cmtx.Unlock()
    defer func() {
        if returnErr != nil && !errors.Is(returnErr, context.Canceled) {
            db.metrics.compactionsFailed.Inc()
        }
    }()

    lastBlockMaxt := int64(math.MinInt64)
    defer func() {
        if err := db.head.truncateWAL(lastBlockMaxt); err != nil {
            returnErr = errors.Join(returnErr, fmt.Errorf("WAL truncation in Compact defer: %w", err))
        }
    }()

    for {
        // Stop if shutting down.
        select {
        case <-db.stopc:
            return nil
        default:
        }

        if !db.head.compactable() {
            if !db.timeWhenCompactionDelayStarted.IsZero() {
                db.timeWhenCompactionDelayStarted = time.Time{}
            }
            break
        }

        if db.timeWhenCompactionDelayStarted.IsZero() {
            db.timeWhenCompactionDelayStarted = time.Now()
        }
        if db.waitingForCompactionDelay() {
            break
        }

        mint := db.head.MinTime()
        maxt := rangeForTimestamp(mint, db.head.chunkRange.Load())
        rh := NewRangeHeadWithIsolationDisabled(db.head, mint, maxt-1)

        db.head.WaitForAppendersOverlapping(rh.MaxTime())

        if err := db.compactHead(rh); err != nil {
            return fmt.Errorf("compact head: %w", err)
        }
        lastBlockMaxt = maxt
    }

    if err := db.head.truncateWAL(lastBlockMaxt); err != nil {
        return fmt.Errorf("WAL truncation in Compact: %w", err)
    }

    if lastBlockMaxt != math.MinInt64 {
        if err := db.compactOOOHead(ctx); err != nil {
            return fmt.Errorf("compact ooo head: %w", err)
        }
    }

    return db.compactBlocks()
}

Conceptually, Compact does three things:

  1. Compact the head into new blocks, in time windows derived from chunkRange, waiting for any overlapping appenders to finish.
  2. Truncate the WAL to the maximum time we know has been safely persisted as blocks, tracking that via lastBlockMaxt and a defer.
  3. Compact out-of-order data and older blocks via compactOOOHead and compactBlocks, which handle different invariants.

Out-of-order samples and mmap safety

Prometheus supports out-of-order (OOO) ingestion via a separate WAL (WBL) and an OOOCompactionHead. That introduces a subtle requirement: queries must not observe chunks that are about to be garbage-collected while still mmap’d.

DB enforces this with a shared reference:

  • lastGarbageCollectedMmapRef (under mtx) tracks the last safe mmap ref up to which old chunks have been reclaimed.
  • The OOO head exposes a minimum safe reference and the last WBL file for compaction to respect.
  • When building an OOO-aware querier, head.oooIso.TrackReadAfter(lastGarbageCollectedMmapRef) ensures we don’t hand out readers pointing into freed memory.

Compaction and querying coordinate through that single monotonic reference, which is a simple but powerful way to keep cross-cutting safety constraints under control.

Retention: time and size without data loss

Compaction creates new blocks; retention decides when to remove old ones. Deleting the wrong block is catastrophic, so retention logic is conservative and explicit.

Time-based retention is implemented in BeyondTimeRetention:

// BeyondTimeRetention returns those blocks which are beyond the time retention.
func BeyondTimeRetention(db *DB, blocks []*Block) (deletable map[ulid.ULID]struct{}) {
    retentionDuration := db.getRetentionDuration()
    if len(blocks) == 0 || retentionDuration == 0 {
        return deletable
    }

    deletable = make(map[ulid.ULID]struct{})
    for i, block := range blocks {
        if i > 0 && blocks[0].Meta().MaxTime-block.Meta().MaxTime >= retentionDuration {
            for _, b := range blocks[i:] {
                deletable[b.meta.ULID] = struct{}{}
            }
            db.metrics.timeRetentionCount.Inc()
            break
        }
    }
    return deletable
}

In words:

  • Assume blocks[0] is the newest by MaxTime.
  • Scan until a block whose MaxTime is at least retentionDuration older than the newest.
  • Everything strictly older than that boundary is safe to delete.

Size-based retention layers on top and includes the head/WAL footprint:

// BeyondSizeRetention returns those blocks which are beyond the size retention.
func BeyondSizeRetention(db *DB, blocks []*Block) (deletable map[ulid.ULID]struct{}) {
    if len(blocks) == 0 {
        return deletable
    }

    maxBytes, maxPercentage := db.getRetentionSettings()

    if maxPercentage > 0 {
        diskSize := db.fsSizeFunc(db.dir)
        if diskSize <= 0 {
            db.logger.Warn("Unable to retrieve filesystem size...", "dir", db.dir)
        } else {
            maxBytes = int64(float64(diskSize) * maxPercentage / 100)
        }
    }

    if maxBytes <= 0 {
        return deletable
    }

    deletable = make(map[ulid.ULID]struct{})

    // Start with Head+WAL size.
    blocksSize := db.Head().Size()
    for i, block := range blocks {
        blocksSize += block.Size()
        if blocksSize > maxBytes {
            for _, b := range blocks[i:] {
                deletable[b.meta.ULID] = struct{}{}
            }
            db.metrics.sizeRetentionCount.Inc()
            break
        }
    }
    return deletable
}

Two design details matter here for safe orchestration:

  • Retention settings are read via getRetentionDuration/getRetentionSettings, which are guarded by retentionMtx. ApplyConfig can update retention at runtime without data races.
  • Size retention explicitly includes Head().Size() and WAL size; otherwise, disk usage would appear lower than it really is, and retention would under-delete.

Crash-safe deletions via atomic rename

Marking blocks as deletable is only half of retention. The IO pattern used to remove them from disk determines how resilient the system is to crashes and restarts.

// deleteBlocks closes the block if loaded and deletes blocks from disk.
func (db *DB) deleteBlocks(blocks map[ulid.ULID]*Block) error {
    for ulid, block := range blocks {
        if block != nil {
            if err := block.Close(); err != nil {
                db.logger.Warn("Closing block failed", "err", err, "block", ulid)
            }
        }

        toDelete := filepath.Join(db.dir, ulid.String())
        switch _, err := os.Stat(toDelete); {
        case os.IsNotExist(err):
            continue
        case err != nil:
            return fmt.Errorf("stat dir %v: %w", toDelete, err)
        }

        // Replace atomically to avoid partial block when process would crash during deletion.
        tmpToDelete := filepath.Join(db.dir, fmt.Sprintf("%s%s", ulid, tmpForDeletionBlockDirSuffix))
        if err := fileutil.Replace(toDelete, tmpToDelete); err != nil {
            return fmt.Errorf("replace of obsolete block for deletion %s: %w", ulid, err)
        }
        if err := os.RemoveAll(tmpToDelete); err != nil {
            return fmt.Errorf("delete obsolete block %s: %w", ulid, err)
        }
        db.logger.Info("Deleting obsolete block", "block", ulid)
    }
    return nil
}

The pattern is:

  1. Close any in‑memory representation so no new readers latch onto the block.
  2. Stat the directory to handle the case where a previous run already deleted it.
  3. Atomically rename the directory to a temporary “for-deletion” name.
  4. Recursively delete the temporary directory.

If Prometheus crashes half‑way through, the worst case is a .tmp-for-deletion directory, which is safe to clean up on the next startup. Multi-step deletion becomes an atomic intent switch (rename) followed by garbage collection (remove-all).

Concern Naïve approach What TSDB does
Choosing blocks to delete “Delete anything older than retention” Time & size retention over ordered blocks + compaction metadata
Deleting on disk os.RemoveAll(blockDir) fileutil.Replace (rename) then RemoveAll
Crash during delete Risk of partial or corrupted blocks Idempotent cleanup of .tmp-for-deletion dirs

Querying consistently under change

While compaction and retention reshape the store, Prometheus still has to serve queries that behave as if they’re talking to a single, stable database. The Querier method is where that illusion is assembled.

Composing head and blocks

A query over [mint, maxt] should see:

  • All on-disk blocks overlapping that time range.
  • The head (and OOO data) for any time that hasn’t yet been compacted.

DB.Querier puts that together as follows:

func (db *DB) Querier(mint, maxt int64) (_ storage.Querier, err error) {
    var blocks []BlockReader

    db.mtx.RLock()
    for _, b := range db.blocks {
        if b.OverlapsClosedInterval(mint, maxt) {
            blocks = append(blocks, b)
        }
    }
    db.mtx.RUnlock()

    blockQueriers := make([]storage.Querier, 0, len(blocks)+1)
    defer func() {
        if err != nil {
            for _, q := range blockQueriers {
                _ = q.Close()
            }
        }
    }()

    overlapsOOO := overlapsClosedInterval(mint, maxt, db.head.MinOOOTime(), db.head.MaxOOOTime())
    var headQuerier storage.Querier
    inoMint := max(db.head.MinTime(), mint)

    if maxt >= db.head.MinTime() || overlapsOOO {
        rh := NewRangeHead(db.head, mint, maxt)
        headQuerier, err = db.blockQuerierFunc(rh, mint, maxt)
        if err != nil {
            return nil, fmt.Errorf("open block querier for head %s: %w", rh, err)
        }

        shouldClose, getNew, newMint := db.head.IsQuerierCollidingWithTruncation(mint, maxt)
        if shouldClose {
            if err := headQuerier.Close(); err != nil {
                return nil, fmt.Errorf("closing head block querier %s: %w", rh, err)
            }
            headQuerier = nil
        }
        if getNew {
            rh := NewRangeHead(db.head, newMint, maxt)
            headQuerier, err = db.blockQuerierFunc(rh, newMint, maxt)
            if err != nil {
                return nil, fmt.Errorf("open block querier for head while getting new querier %s: %w", rh, err)
            }
            inoMint = newMint
        }
    }

    if overlapsOOO {
        isoState := db.head.oooIso.TrackReadAfter(db.lastGarbageCollectedMmapRef)
        headQuerier = NewHeadAndOOOQuerier(inoMint, mint, maxt, db.head, isoState, headQuerier)
    }

    if headQuerier != nil {
        blockQueriers = append(blockQueriers, headQuerier)
    }

    for _, b := range blocks {
        q, err := db.blockQuerierFunc(b, mint, maxt)
        if err != nil {
            return nil, fmt.Errorf("open querier for block %s: %w", b, err)
        }
        blockQueriers = append(blockQueriers, q)
    }

    return storage.NewMergeQuerier(blockQueriers, nil, storage.ChainedSeriesMerge), nil
}

The coordination work here is subtle:

  • Block selection under a read lock. The iteration over db.blocks happens under mtx.RLock(), so concurrent reloadBlocks calls can’t change the list mid‑selection.
  • Head truncation awareness. IsQuerierCollidingWithTruncation decides whether the head querier might collide with future WAL truncation and, if needed, re-creates a safer querier with an updated mint.
  • OOO wrapping only when needed. If the query overlaps OOO time ranges, NewHeadAndOOOQuerier wraps the head querier together with an isolation state derived from lastGarbageCollectedMmapRef.
  • Merging via composition. All individual queriers are combined into a single MergeQuerier, which implements the same storage.Querier interface as any single backend.

From an API design perspective, this is a clean use of the decorator pattern: instead of bloating the core Head or Block types, cross-cutting concerns like OOO isolation and truncation safety are implemented by wrapping existing interfaces.

Operational sanity: metrics and observability

None of this orchestration is useful if operators can’t see whether it’s actually working. DB exposes Prometheus metrics that align directly with the mechanisms we’ve just walked through.

A few examples:

  • prometheus_tsdb_compactions_failed_total — incremented inside Compact whenever a non‑canceled error occurs. This tells you if the GC cycle is healthy.
  • prometheus_tsdb_storage_blocks_bytes — updated in reloadBlocks by summing block.Size(). This is your early warning for disk pressure.
  • prometheus_tsdb_lowest_timestamp — a gauge reporting the minimum time across blocks and head, effectively your real retention horizon.
  • prometheus_tsdb_reloads_failures_total — incremented whenever reloadBlocks fails, surfacing on-disk or filesystem issues.

These are wired exactly where decisions are made — compactions, reloads, block accounting — so the metrics reflect the actual control flow, not just high-level guesses. Alert rules can then be expressed in terms of those mechanisms (for example, a non‑zero rate of compaction failures over a few minutes).

What we should steal for our own systems

Stepping back, tsdb/db.go is not just “how Prometheus stores metrics”. It’s a blueprint for coordinating a complex, stateful subsystem in a way that remains legible over time. A few patterns are worth reusing directly.

1. Treat orchestration as a first-class responsibility

The TSDB’s DB has a large surface area, but its job is narrow: orchestrate lifecycles of focused components (Head, Block, Compactor, WALs). That works because:

  • Each sub-component owns its core logic (WAL, compaction algorithms, block format).
  • The orchestrator mainly sequences operations and enforces invariants between them.
  • Strategy hooks like NewCompactorFunc, BlockQuerierFunc, and FsSizeFunc keep it from being tightly coupled to specific implementations.

2. Design compaction like garbage collection

Whether you’re compacting events, logs, or metrics, a GC-style approach scales:

  • Define clear time windows and invariants for compaction (for example, only compact ranges that are sufficiently behind “now”).
  • Separate “decide what to compact” from “apply compaction” for testability.
  • Guard compaction and deletion behind a single mutex so they never interleave in unsafe ways.
  • Explicitly tie WAL/log truncation to successfully persisted ranges.

3. Make deletions crash-resilient and idempotent

Closing, atomically renaming, then recursively deleting block directories turns a dangerous multi-step operation into an idempotent, crash‑safe sequence. Any system deleting hierarchical or multi‑file artifacts benefits from the same pattern.

4. Build query isolation through composition

Instead of embedding every concern into a single data structure, Prometheus layers behavior:

  • Range views like RangeHead limit time visibility.
  • Wrappers like NewHeadAndOOOQuerier add OOO and isolation semantics on top of existing queriers.
  • MergeQuerier unifies multiple backends behind one interface.

This keeps the orchestrator in control of how components are combined, without forcing each component to know about every mode of operation.

5. Expose the health of each mechanism

Finally, metrics like prometheus_tsdb_compactions_failed_total, prometheus_tsdb_storage_blocks_bytes, and prometheus_tsdb_reloads_failures_total are not decoration; they’re part of the control loop.

  • Add counters for attempts and failures of each background job.
  • Add gauges for capacity: disk usage, time window covered, head size.
  • Document concrete alert conditions directly linked to those metrics.

The primary lesson from tsdb/db.go is that complex, stateful systems stay sane when orchestration is explicit, conservative, and observable. Clear ownership of responsibilities, carefully scoped locks, crash-safe IO patterns, and composable abstractions are what keep Prometheus’ TSDB from collapsing under its own weight — and they’re exactly the patterns we can apply to our own architectures.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article