The Operator Learning Problem

In this chapter

1What Problem Are We Solving?
2The Spatial Domain D
3Function Spaces A and U
4The Operator G†
5The Parametric Approximation
6Discretization and Resolution Invariance

1What Problem Are We Solving?

We follow Li et al. (arXiv:2010.08895), referred to throughout as the FNO paper. The central problem is this: traditional PDE solvers (finite elements, finite differences, spectral methods) are accurate but slow. Every time you change a parameter — a boundary condition, an initial condition, a material property — you must re-solve from scratch. What if a neural network could learn the mapping from parameters to solutions, and then predict solutions instantly for any new set of parameters?

Our running example throughout this course is 2D Darcy flow, a PDE that describes pressure-driven flow through porous media (think: groundwater through rock, or oil through a reservoir). The permeability of the rock varies spatially — some regions are porous and easy to flow through, others are dense and resist flow. Given a permeability field $a(x)$, we want to find the resulting pressure field $u(x)$.

Darcy Flow (Running Example)

On the domain $D = (0,1)^2$ with homogeneous Dirichlet boundary conditions:

$$ -\nabla \cdot \bigl(a(x)\,\nabla u(x)\bigr) = f(x), \qquad u\big|_{\partial D} = 0, $$

where $a(x) > 0$ is the permeability (input), $u(x)$ is the pressure (output), and $f(x) = 1$ is a constant forcing.

In words: the divergence of the flux $a(x)\nabla u(x)$ equals the source term $f$. Where permeability $a$ is high, flow moves easily and pressure gradients are small. Where $a$ is low, flow is impeded and pressure builds up.

Physics intuition: Every time you change the rock permeability $a(x)$ — say, by drilling a new well or discovering a different geological layer — you need to re-solve the PDE. A traditional solver might take minutes or hours for a high-resolution 3D problem. What if a neural network could instantly predict $u(x)$ for any $a(x)$?

This is not a toy problem. Darcy flow is the workhorse PDE for subsurface modeling, reservoir engineering, and groundwater hydrology. The FNO paper uses it as a primary benchmark, and so will we.

2The Spatial Domain D

The paper begins: "Let $D$ be a bounded, open subset of $\R^d$." In our concrete example:

Domain

$D = (0,1)^2 \subset \R^2$ — the open unit square. We have $d=2$ spatial dimensions, and each point $x = (x_1, x_2) \in D$ is a location in the square.

The word open means we exclude the boundary (the edges of the square). The word bounded means the domain doesn't extend to infinity. These are standard assumptions that guarantee the PDE is well-posed.

Figure 1.1. The domain $D = (0,1)^2$. The dashed boundary $\partial D$ is where $u=0$ (Dirichlet condition). Points $x \in D$ are locations where we evaluate $a(x)$ and $u(x)$.

3Function Spaces A and U

The paper defines: "$\mathcal{A} = \mathcal{A}(D; \R^{\da})$" and "$\mathcal{U} = \mathcal{U}(D; \R^{\du})$." These are the spaces of input and output functions respectively.

For Darcy flow:

$\mathcal{A}$ = the space of all valid permeability fields. Each $a \in \mathcal{A}$ is a function $a: D \to \R^+$ (a positive scalar at every point). So $\da = 1$.
$\mathcal{U}$ = the space of all valid pressure fields. Each $u \in \mathcal{U}$ is a function $u: D \to \R$ (a scalar at every point). So $\du = 1$.

In general, $\da$ and $\du$ can be larger than 1. For a vector PDE like Navier–Stokes, the output might be a velocity field $u: D \to \R^2$, giving $\du = 2$. But for Darcy flow, both are scalar: one number per spatial point.

Banach Spaces (Light Touch)

$\mathcal{A}$ and $\mathcal{U}$ are Banach spaces — complete normed function spaces. For our purposes, think of them as "$L^2(D)$: the space of square-integrable functions on $D$," equipped with the norm $\|a\|_{L^2}^2 = \int_D |a(x)|^2\,dx$. Completeness means: limits of convergent sequences stay in the space.

Figure 1.2. The operator G† maps a permeability field $a(x)$ (left, piecewise constant: 12 or 3) to a pressure field $u(x)$ (right, smooth with $u=0$ on the boundary). Each is a function defined over the entire domain $D$.

4The Operator G†

The paper states: "$G^\dagger: \mathcal{A} \to \mathcal{U}$ maps input functions to solutions." This is the conceptual core of the entire paper.

A regular function $f: \R \to \R$ takes a number and returns a number. An operator $G^\dagger: \mathcal{A} \to \mathcal{U}$ takes an entire function $a(\cdot)$ and returns an entire function $u(\cdot)$.

Key idea: This is the conceptual leap. We are not learning $f(x) = y$ where $x$ and $y$ are numbers or finite-dimensional vectors. We are learning $G^\dagger(a) = u$ where both $a$ and $u$ are functions defined over the whole domain $D$. The input is infinite-dimensional (a function has a value at every point), and so is the output.

For Darcy flow: $G^\dagger$ is the operator that takes any permeability field $a \in \mathcal{A}$ and returns the corresponding pressure solution $u \in \mathcal{U}$. Internally, $G^\dagger$ is defined implicitly by the PDE — $u$ is whatever function satisfies $-\nabla\cdot(a\nabla u) = f$ with the given boundary conditions.

Figure 1.3. Standard neural networks map vectors to vectors: $f_\theta: \R^n \to \R^m$. Neural operators map functions to functions: $G^\dagger_\theta: \mathcal{A} \to \mathcal{U}$. Both input and output live in infinite-dimensional function spaces.

5The Parametric Approximation

We want to learn $G^\dagger$ from data. The paper introduces a parametric approximation $G_\theta: \mathcal{A} \to \mathcal{U}$ with parameters $\theta \in \Theta$ (the neural network weights). The goal is to find $\theta$ such that $G_\theta \approx G^\dagger$.

We assume access to $N$ training pairs $\{(a_j, u_j)\}_{j=1}^N$ where each $u_j = G^\dagger(a_j)$ is obtained by solving the PDE with a traditional solver. The optimization problem is:

Training Objective $$ \min_{\theta \in \Theta}\; \mathbb{E}_{a \sim \mu}\bigl[\mathcal{C}\bigl(G_\theta(a),\, G^\dagger(a)\bigr)\bigr] \tag{1} $$

where $\mu$ is a probability measure on $\mathcal{A}$ (the distribution of input functions we care about) and $\mathcal{C}$ is a cost functional. The paper uses the relative $L^2$ error:

$$ \mathcal{C}(G_\theta(a),\, G^\dagger(a)) = \frac{\|G_\theta(a) - G^\dagger(a)\|_{L^2(D)}}{\|G^\dagger(a)\|_{L^2(D)}} $$

The Training Setup

Generate data: Sample $N$ random permeability fields $a_1, \ldots, a_N$ from distribution $\mu$.
Solve PDEs: For each $a_j$, solve the Darcy equation to get $u_j = G^\dagger(a_j)$. This is expensive (the whole reason we're doing this).
Train: Optimize neural operator $G_\theta$ to minimize the average relative $L^2$ error between $G_\theta(a_j)$ and $u_j$.
Deploy: For any new permeability $a_*$, predict $u_* \approx G_\theta(a_*)$ in milliseconds, without re-solving the PDE.

Concretely for Darcy flow: generate 1000 random permeability fields, solve the PDE for each (taking perhaps hours of compute), then train $G_\theta$ on those 1000 pairs. Once trained, $G_\theta$ can predict pressure for a new permeability in a single forward pass.

6Discretization and Resolution Invariance

Functions live in continuous space, but computers work with discrete grids. The paper addresses this carefully: we evaluate each function on a finite set of grid points $D_j = \{x_1, \ldots, x_n\} \subset D$.

On this grid, the continuous function $a: D \to \R^{\da}$ becomes a finite vector:

$$ a\big|_{D_j} \in \R^{n \times \da}, \qquad u\big|_{D_j} \in \R^{n \times \du}. $$

For Darcy on a $16 \times 16$ grid: $n = 256$ points, so $a|_D \in \R^{256 \times 1}$ and $u|_D \in \R^{256 \times 1}$. On a $64 \times 64$ grid: $n = 4096$ and the same function is represented by 4096 values.

Figure 1.4. The same permeability function $a(x)$ represented on a coarse 4×4 grid ($n=16$ values) and a fine 16×16 grid ($n=256$ values). The underlying function is the same — only the discretization changes.

Now here is the crucial property that distinguishes neural operators from standard neural networks:

Resolution invariance: The operator $G^\dagger$ lives in continuous function space. It doesn't care about grids. A neural operator should inherit this property: train on a $16\times 16$ grid, and evaluate on a $64\times 64$ or $256\times 256$ grid — without retraining. This is impossible for a standard neural network, whose architecture is tied to a fixed input size.

A standard CNN or MLP takes a fixed-size input (say, a $64 \times 64$ image) and cannot handle a $128 \times 128$ image without architectural changes. A neural operator, by contrast, is designed so that its learned parameters are independent of the grid resolution. We will see exactly how FNO achieves this in Chapters 3–4, through its Fourier-space parameterization.

This completes our setup of the problem. We have:

A domain $D = (0,1)^2$
Function spaces $\mathcal{A}$ (permeabilities) and $\mathcal{U}$ (pressures)
A ground-truth operator $G^\dagger: \mathcal{A} \to \mathcal{U}$ defined by the Darcy PDE
A training objective: find $G_\theta \approx G^\dagger$
A key desideratum: resolution invariance

In the next chapter, we build the neural network architecture that will approximate $G^\dagger$.