Skip to main content
المدونة

Zalt Blog

Deep Dives into Code & Architecture

AT SCALE

The Facade That Makes Pandas Feel Simple

By محمود الزلط
Code Cracking
30m read
<

Ever wondered why working with pandas can feel so straightforward despite what’s under the hood? This dives into the facade that keeps it simple.

/>
The Facade That Makes Pandas Feel Simple - Featured blog post image

MENTORING

1:1 engineering mentorship.

Architecture, AI systems, career growth. Ongoing or one-off.

We’re examining how pandas manages to feel simple while sitting on top of a very complex engine. Pandas is the de‑facto data wrangling library in Python, widely used for analytics, ETL, and experimentation. At the center of its design is NDFrame in pandas/core/generic.py—the base class behind both Series and DataFrame.

I’m Mahmoud Zalt, an AI software engineer. We’ll walk through NDFrame as if we’re pair‑programming with the pandas core team, focusing on what makes it such an effective abstraction.

Our guiding idea: NDFrame is a facade—one class that hides enormous complexity behind a stable, friendly surface. We’ll see how that facade is built around axes and alignment, how it juggles Copy‑on‑Write and inplace, how IO hangs off the same surface, and how performance shaping shows up in its API.

NDFrame as the central facade

To understand the rest of pandas, we first need a mental model of where NDFrame sits in the architecture.

pandas/
  core/
    internals/        # BlockManager (storage backend)
    indexes/          # Index, MultiIndex, DatetimeIndex
    window/           # Rolling, Expanding, EWM
    generic.py  <---- # NDFrame base class
    series.py   ----> # Series(NDFrame)
    frame.py    ----> # DataFrame(NDFrame)
  io/
    formats/          # ExcelFormatter, DataFrameFormatter
    pickle.py         # to_pickle
    json/             # to_json
    sql.py            # to_sql

NDFrame
  |_ _mgr (Manager/BlockManager)
  |_ index / columns (axes)
  |_ attrs / flags
  |_ IO methods (to_csv, to_json, ...)
  |_ numeric & stat ops (sum, mean, ...)
  |_ alignment & indexing helpers
NDFrame as the core data-model facade between user APIs and storage internals.

Every Series and DataFrame instance is an NDFrame. They inherit most behavior from this base class, and only customize things like display and how axes are named.

At the core is a clear split:

  • NDFrame knows about labels, axes, metadata, and high‑level operations.
  • Manager / BlockManager owns the actual arrays and low‑level algorithms.

The constructor shows this boundary explicitly:

class NDFrame(PandasObject, indexing.IndexingMixin):
    _internal_names: list[str] = [
        "_mgr",
        "_cache",
        "_name",
        "_metadata",
        "_flags",
    ]
    ...

    def __init__(self, data: Manager) -> None:
        object.__setattr__(self, "_mgr", data)
        object.__setattr__(self, "_attrs", {})
        object.__setattr__(self, "_flags", Flags(self, allows_duplicate_labels=True))

Think of NDFrame as the spreadsheet “sheet” and _mgr as the storage engine. The sheet knows rows, columns, labels, and operations like reindex or fillna. The engine knows how to slice, reblock, and compute over memory.

Axes and alignment: the real superpower

Once we see NDFrame as a facade, the next idea is that everything revolves around axes: index and columns. Most of pandas’ “it just works” behavior comes from axis handling and alignment.

Axis resolution: user names vs internal numbering

NDFrame accepts axes in all the ways users expect: 0, 1, "index", "columns", "rows". Internally, it needs a consistent representation and a mapping to storage layout.

_AXIS_ORDERS: list[Literal["index", "columns"]]
_AXIS_TO_AXIS_NUMBER: dict[Axis, AxisInt] = {0: 0, "index": 0, "rows": 0}
_info_axis_number: int
_info_axis_name: Literal["index", "columns"]
_AXIS_LEN: int

@final
@classmethod
def _get_axis_number(cls, axis: Axis) -> AxisInt:
    try:
        return cls._AXIS_TO_AXIS_NUMBER[axis]
    except KeyError as err:
        raise ValueError(
            f"No axis named {axis} for object type {cls.__name__}"
        ) from err

@final
def _get_axis(self, axis: Axis) -> Index:
    axis_number = self._get_axis_number(axis)
    assert axis_number in {0, 1}
    return self.index if axis_number == 0 else self.columns

@final
@classmethod
def _get_block_manager_axis(cls, axis: Axis) -> AxisInt:
    """Map the axis to the block_manager axis."""
    axis = cls._get_axis_number(axis)
    ndim = cls._AXIS_LEN
    if ndim == 2:
        # i.e. DataFrame
        return 1 - axis
    return axis

There are three axis spaces in play:

  • User axis: what you pass (0/1, "index", "columns").
  • Logical axis: NDFrame’s view (index, columns).
  • BlockManager axis: how storage is laid out (often flipped for performance).

This indirection lets pandas keep a stable API even if storage changes (for example, from column blocks to column‑per‑array backends). The rest of NDFrame calls _get_axis and _get_block_manager_axis instead of hard‑coding 0/1.

Alignment as a primitive operation

Alignment is what makes operations like df1 + df2 behave by label, not by position. Instead of “position 0 plus position 0”, NDFrame thinks “row label A plus row label A, column label X plus column label X”.

reindex is the public face of this idea:

def reindex(
    self,
    labels=None,
    *,
    index=None,
    columns=None,
    axis: Axis | None = None,
    method: ReindexMethod | None = None,
    copy: bool | lib.NoDefault = lib.no_default,
    level: Level | None = None,
    fill_value: Scalar | None = np.nan,
    limit: int | None = None,
    tolerance=None,
) -> Self:
    ...
    axes: dict[Literal["index", "columns"], Any] = {
        "index": index,
        "columns": columns,
    }
    method = clean_reindex_fill_method(method)

    if all(
        self._get_axis(axis_name).identical(ax)
        for axis_name, ax in axes.items()
        if ax is not None
    ):
        return self.copy(deep=False)

    if self._needs_reindex_multi(axes, method, level):
        return self._reindex_multi(axes, fill_value)

    return self._reindex_axes(
        axes, level, limit, tolerance, method, fill_value
    ).__finalize__(self, method="reindex")

The more interesting use of alignment is in internal helpers like _where, which powers where and mask:

@final
def _where(
    self,
    cond,
    other=lib.no_default,
    *,
    inplace: bool = False,
    axis: Axis | None = None,
    level=None,
) -> Self:
    ...
    cond = common.apply_if_callable(cond, self)
    if isinstance(cond, NDFrame):
        if cond.ndim == 1 and self.ndim == 2:
            cond = cond._constructor_expanddim(
                dict.fromkeys(range(len(self.columns)), cond),
                copy=False,
            )
            cond.columns = self.columns
        cond = cond.align(self, join="right")[0]
    else:
        ...  # coerce to array and wrap

Instead of assuming cond already matches the frame, NDFrame:

  1. Expands 1D conditions to 2D when necessary.
  2. Uses align with a defined join ("right") to enforce shape compatibility.
  3. Then validates and applies the boolean condition block‑wise.

The same alignment machinery underpins arithmetic with another frame, where/mask, and many axis‑aware operations.

Copy-on-Write and inplace: the data‑integrity tightrope

Modern pandas leans on Copy‑on‑Write (CoW): shallow copies share data until one is mutated, at which point a copy is made. Users get cheap views without uncontrolled mutation.

At the same time, pandas has a long history of inplace=True methods like fillna, drop, and where. Making these semantics agree with CoW falls on NDFrame.

_update_inplace: a tiny but central hook

The simplest “in‑place” pattern in NDFrame is: compute a new frame, then swap out _mgr on self:

@final
def _update_inplace(self, result) -> None:
    """Replace self internals with result."""
    # NOTE: This does *not* call __finalize__
    self._mgr = result._mgr

Many mutating methods follow this shape: compute a functional result, then either return it or apply it to self when inplace=True. The code even hints at a central _maybe_apply_inplace helper to enforce consistent behavior across all such methods.

fillna: a case study in complexity

fillna is one of pandas’ most‑used APIs and also one of the most complex in NDFrame. It has to handle:

  • Multiple input types: scalar, dict, Series, DataFrame.
  • Axis choices (row‑wise vs column‑wise).
  • inplace vs non‑inplace semantics.
  • Both 1D (Series) and 2D (DataFrame) shapes.

Here is a condensed version of the method:

@final
def fillna(
    self,
    value: Hashable | Mapping | Series | DataFrame,
    *,
    axis: Axis | None = None,
    inplace: bool = False,
    limit: int | None = None,
) -> Self:
    inplace = validate_bool_kwarg(inplace, "inplace")

    if isinstance(value, (list, tuple)):
        raise TypeError(
            '"value" parameter must be a scalar or dict, '
            f'but you passed a "{type(value).__name__}"'
        )

    if axis is None:
        axis = 0
    axis = self._get_axis_number(axis)

    if self.ndim == 1:
        ...  # Series-specific path

    elif isinstance(value, (dict, ABCSeries)):
        result = self if inplace else self.copy(deep=False)
        if axis == 1:
            ...  # column-wise dict fill
        else:
            for k, v in value.items():
                if k not in result:
                    continue
                res_k = result[k].fillna(v, limit=limit)
                ...  # assign back, respecting inplace
        return result

    elif not is_list_like(value):
        if axis == 1:
            result = self.T.fillna(value=value, limit=limit).T
            new_data = result._mgr
        else:
            new_data = self._mgr.fillna(value=value, limit=limit, inplace=inplace)

    elif isinstance(value, ABCDataFrame) and self.ndim == 2:
        new_data = self.where(self.notna(), value)._mgr
    else:
        raise ValueError(f"invalid fill value with a {type(value)}")

    result = self._constructor_from_mgr(new_data, axes=new_data.axes)
    if inplace:
        self._update_inplace(result)
        return self
    return result.__finalize__(self, method="fillna")

Instead of one linear flow, fillna branches by dimensionality, value type, axis, and inplace. The complexity is real: this is a heavily used public API that needs to preserve long‑standing semantics while working with CoW and multiple shapes.

The static analysis report behind this walkthrough flags this as a code smell: high cyclomatic and cognitive complexity, plus mixed responsibilities. The suggested direction is to split it into internal helpers such as _fillna_series and _fillna_frame, keeping the public method thin and behavior constrained inside smaller, easier‑to‑test functions.

IO and mixins: when the facade gets too wide

NDFrame doesn’t just cover core data operations; it also exposes the high‑level IO APIs most users reach for:

  • to_csv, to_json, to_excel, to_latex
  • to_hdf, to_sql, to_pickle, to_clipboard
  • to_xarray and others

Conceptually, they all mean “serialize this NDFrame somewhere”. Implementation‑wise, each one delegates into a specialized IO module, but they are all presented as methods on the same facade.

to_csv as a thin delegation layer

to_csv is a good example of how NDFrame keeps IO logic thin at the facade level:

@final
def to_csv(
    self,
    path_or_buf: FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None = None,
    *,
    sep: str = ",",
    na_rep: str = "",
    float_format: str | Callable | None = None,
    columns: Sequence[Hashable] | None = None,
    header: bool | list[str] = True,
    index: bool = True,
    index_label: IndexLabel | None = None,
    mode: str = "w",
    encoding: str | None = None,
    compression: CompressionOptions = "infer",
    quoting: int | None = None,
    quotechar: str = '"',
    lineterminator: str | None = None,
    chunksize: int | None = None,
    date_format: str | None = None,
    doublequote: bool = True,
    escapechar: str | None = None,
    decimal: str = ".",
    errors: OpenFileErrors = "strict",
    storage_options: StorageOptions | None = None,
) -> str | None:
    df = self if isinstance(self, ABCDataFrame) else self.to_frame()

    formatter = DataFrameFormatter(
        frame=df,
        header=header,
        index=index,
        na_rep=na_rep,
        float_format=float_format,
        decimal=decimal,
    )

    return DataFrameRenderer(formatter).to_csv(
        path_or_buf,
        lineterminator=lineterminator,
        sep=sep,
        encoding=encoding,
        errors=errors,
        compression=compression,
        quoting=quoting,
        columns=columns,
        index_label=index_label,
        mode=mode,
        chunksize=chunksize,
        quotechar=quotechar,
        date_format=date_format,
        doublequote=doublequote,
        escapechar=escapechar,
        storage_options=storage_options,
    )

The method normalizes Series vs DataFrame, constructs a formatter, and defers everything else to DataFrameRenderer. The facade stays thin; dedicated IO code handles the details.

The cost is that NDFrame now carries dozens of methods whose primary responsibility is IO, not data modeling. That’s where the “monolithic NDFrame” smell comes from in the report. The proposed remedy is to extract them to an IOOpsMixin, making the class composition explicit:

--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -200,6 +200,8 @@
-class NDFrame(PandasObject, indexing.IndexingMixin):
+from pandas.core.mixins import IOOpsMixin
+
+class NDFrame(PandasObject, indexing.IndexingMixin, IOOpsMixin):
@@
-    @final
-    def to_excel(...):
-        ...
-
-    @final
-    def to_json(...):
-        ...
-
-    # similarly move to_hdf, to_sql, to_pickle, to_clipboard, to_xarray,
-    # to_latex, to_csv into IOOpsMixin
+    # IO methods are now mixed in from IOOpsMixin to keep NDFrame lean.

From a user’s perspective, nothing changes: df.to_csv() still exists. For maintainers, IO responsibilities are now separated from the core data model.

Performance and scale: when clean APIs meet big data

NDFrame also serves as the point where performance considerations surface in the API:

  • Vectorized operations via NumPy and extension arrays.
  • Blockwise algorithms implemented in BlockManager.
  • Copy‑on‑Write to avoid unnecessary data copies.

The static analysis highlights hot paths like reductions (sum, mean), alignment (reindex, where), and large‑frame IO. NDFrame consistently pushes work down to the manager and array level to avoid Python loops.

Reductions via a generic helper

Several statistical methods—mean, median, min, max, skew, kurt—share the same structure. Rather than duplicate logic, NDFrame centralizes it in _stat_function:

@final
def _stat_function(
    self,
    name: str,
    func,
    axis: Axis | None = 0,
    skipna: bool = True,
    numeric_only: bool = False,
    **kwargs,
):
    assert name in ["median", "mean", "min", "max", "kurt", "skew"], name
    nv.validate_func(name, (), kwargs)
    validate_bool_kwarg(skipna, "skipna", none_allowed=False)

    return self._reduce(
        func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
    )

The public methods become thin wrappers:

def mean(
    self,
    *,
    axis: Axis | None = 0,
    skipna: bool = True,
    numeric_only: bool = False,
    **kwargs,
) -> Series | float:
    return self._stat_function(
        "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
    )

_reduce then delegates to the manager, which performs blockwise operations over homogeneous chunks. The facade layer validates arguments and names the operation; the storage layer actually computes.

Observability: treating NDFrame methods as units of work

The report suggests concrete metrics for production pipelines that lean heavily on pandas:

  • ndframe_op_duration_seconds{op_name, ndim} – time per high‑level op (for example, reindex, fillna, to_csv).
  • ndframe_memory_bytes – approximate memory footprint before/after key operations.
  • ndframe_io_bytes{op} – bytes written for IO‑heavy calls like to_csv, to_json, to_sql.

If you treat each NDFrame method as a unit of work, these metrics quickly show where your pipelines spend time and memory, and where accidental copies or misaligned operations hurt you.

Lessons you can apply today

Spending time with pandas/core/generic.py is like studying a live case study in facade design for data libraries. The primary lesson is that a single, carefully designed facade can hide huge internal complexity while still scaling in features and performance.

1. Put a deliberate facade in front of complexity

NDFrame gives users one coherent object with natural methods: reindex, fillna, to_csv, rolling, resample. Under the hood it:

  • Delegates storage to Manager/BlockManager.
  • Delegates IO to pandas.io.* modules.
  • Delegates windowing to Rolling, Expanding, and ExponentialMovingWindow.

In your own systems, identify the single type most users should touch, then push everything else behind it.

2. Make alignment and axis semantics explicit

With labeled or multidimensional data, don’t scatter axis logic:

  • Provide helpers like _get_axis_number, _get_axis, and _get_block_manager_axis.
  • Expose alignment operations (like align) and reuse them consistently.
  • Centralize ambiguous semantics (label vs level, index vs columns) in a small set of helpers.

3. Centralize cross‑cutting behavior: CoW, inplace, metadata

NDFrame depends heavily on three cross‑cutting concerns:

  • Copy‑on‑Write: when data is actually copied.
  • inplace semantics: how modifier methods behave, especially under CoW.
  • Metadata propagation: via attrs, flags, and __finalize__.

Instead of open‑coding these behaviors in every method, NDFrame uses hooks like _update_inplace, _check_copy_deprecation, and __finalize__. If you’re evolving a large API, investing in these central hooks early pays off.

4. Split large methods by shape and type

Methods like fillna and where naturally accumulate branching for different shapes and input types. The report’s proposed refactor—separate helpers for series vs data frame paths—is a pattern worth using: keep the public signature stable, and dispatch immediately to small, specialized helpers.

5. Use mixins when a class starts to sprawl

When a core class starts to host IO, formatting, windowing, and data‑model behavior, you’re approaching “god class” territory. NDFrame mitigates this with indexing mixins today, and the report suggests going further with an IOOpsMixin.

In your own code, consider mixins for IO (serialization/deserialization), visualization or formatting, and domain‑specific utilities. Callers still see one facade; maintainers see a set of focused components.


NDFrame is the beating heart of pandas. It’s large and dense, but it’s also a concise demonstration of how a well‑designed facade can make a massive codebase feel approachable from the outside. The file shows how to separate user‑friendly semantics from storage, centralize axis and alignment logic, reconcile CoW with inplace, and keep IO on a short leash.

Next time you call df.to_csv() or df.fillna(...), there is a lot of choreography happening just beneath that friendly surface. Understanding how NDFrame pulls this off gives you concrete patterns you can apply to your own data‑heavy systems.

Full Source Code

Here's the full source code of the file that inspired this article.
Read on GitHub

Thanks for reading! I hope this was useful. If you have questions or thoughts, feel free to reach out.

Content Creation Process: This article was generated via a semi-automated workflow using AI tools. I prepared the strategic framework, including specific prompts and data sources. From there, the automation system conducted the research, analysis, and writing. The content passed through automated verification steps before being finalized and published without manual intervention.

Mahmoud Zalt

About the Author

I’m Zalt, a technologist with 16+ years of experience, passionate about designing and building AI systems that move us closer to a world where machines handle everything and humans reclaim wonder.

Let's connect if you're working on interesting AI projects, looking for technical advice or want to discuss anything.

Support this content

Share this article