Chapter 2

The Neural Operator Architecture

Lift, iterate, project: the three-stage pipeline from input function to output function, with kernel integral operators at its core.

In this chapter
  1. 7Architecture Overview: Lift → Iterate → Project
  2. 8The Lifting Operator P
  3. 9The Iterative Update (Definition 1)
  4. 10The Kernel Integral Operator (Definition 2)
  5. 11The Projection Operator Q
  6. 12Putting It Together

7Architecture Overview: Lift → Iterate → Project

The neural operator has three stages, applied to the input function $a(x)$ at every point $x \in D$:

  1. Lifting ($P$): Map the input from $\R^{\da}$ to a higher-dimensional representation $\R^{\dv}$. This is a pointwise fully-connected layer applied independently at each spatial location.
  2. Iterative layers ($v_0 \to v_1 \to \cdots \to v_T$): Apply $T$ integral operator layers, each of which combines local information (at each point) with global information (from the entire domain).
  3. Projection ($Q$): Map back from $\R^{\dv}$ to $\R^{\du}$. Again pointwise, applied independently at each spatial location.

In symbols:

$$ a(x) \;\xrightarrow{\;P\;}\; v_0(x) \;\xrightarrow{\;\text{Layer 1}\;}\; v_1(x) \;\xrightarrow{\;\text{Layer 2}\;}\; \cdots \;\xrightarrow{\;\text{Layer }T\;}\; v_T(x) \;\xrightarrow{\;Q\;}\; u(x) $$
a(x) P Layer 1 R³² Layer 2 R³² ··· Layer 4 R³² Q u(x) dₐ=1 dᵥ=32 (hidden channels) dᵘ=1
Figure 2.1. The neural operator pipeline. $P$ lifts from $\R^{\da}$ to $\R^{\dv}$, four iterative layers transform in $\R^{\dv}$, and $Q$ projects back to $\R^{\du}$. For Darcy flow: $\da = 1$, $\dv = 32$, $\du = 1$.

8The Lifting Operator P

The lifting operator $P$ is a pointwise fully-connected network that maps each input value from $\R^{\da}$ to a higher-dimensional latent space $\R^{\dv}$:

$$ v_0(x) = P\bigl(a(x)\bigr), \qquad P: \R^{\da} \to \R^{\dv}. $$

"Pointwise" means $P$ is applied independently at each spatial location $x$ — it does not look at neighboring points. It is the same linear map (same weights) at every $x$.

Lifting Operator

For Darcy flow: $P: \R^1 \to \R^{32}$. At each grid point $x$, the single scalar value $a(x)$ (permeability) is mapped to a 32-dimensional feature vector $v_0(x) \in \R^{32}$. This is equivalent to a fully-connected layer with weight matrix $W_P \in \R^{32 \times 1}$ and bias $b_P \in \R^{32}$.

a(x) 1 value P v₀(x) ∈ R³² 32 channels Like: grayscale → 32-channel feature map
Figure 2.2. The lifting operator $P$ at a single grid point: one scalar value fans out to $\dv = 32$ channels. Applied identically at every point $x \in D$.

Think of this like the first convolutional layer in image processing, but with a $1 \times 1$ kernel: no spatial mixing, just channel expansion. If the input were a grayscale image ($\da = 1$), $P$ converts it to a 32-channel feature map.


9The Iterative Update (Definition 1)

Each iterative layer updates the hidden representation $v_t(x) \in \R^{\dv}$ to $v_{t+1}(x) \in \R^{\dv}$ using two parallel paths:

Definition 1: Iterative Update
$$ v_{t+1}(x) = \sigma\!\Bigl(\underbrace{W\, v_t(x)}_{\text{local}} + \underbrace{\bigl(\mathcal{K}(a;\,\varphi)\, v_t\bigr)(x)}_{\text{global}} + b\Bigr) \tag{2} $$

where $W \in \R^{\dv \times \dv}$ is a pointwise linear map, $\mathcal{K}(a;\varphi)$ is a kernel integral operator (defined next), and $\sigma$ is a nonlinear activation (ReLU).

The two paths do fundamentally different things:

The outputs are summed, a bias $b$ is added, and the result passes through ReLU.

vₜ(x) W local (pointwise) K global (integral) + σ vₜ₊₁
Figure 2.3. One iterative layer: the input $v_t(x)$ splits into a local path ($W$, pointwise) and a global path ($\mathcal{K}$, integral over the domain). The results are summed and passed through ReLU.

10The Kernel Integral Operator (Definition 2)

The global path is an integral operator. The paper defines it precisely:

Definition 2: Kernel Integral Operator
$$ \bigl(\mathcal{K}(a;\,\varphi)\, v_t\bigr)(x) = \int_D \kappa\bigl(x,\, y,\, a(x),\, a(y);\, \varphi\bigr)\, v_t(y)\, dy \tag{3} $$

where $\kappa: \R^{2(d + \da)} \to \R^{\dv \times \dv}$ is a kernel function parameterized by a neural network with weights $\varphi$.

Let's break down every piece of this integral:

Concretely: at point $x = (0.5, 0.5)$, we sum up contributions from every other grid point $y$, each weighted by the kernel $\kappa$. Points where $\kappa$ is large contribute more; points where $\kappa$ is small contribute less.

x Arrow thickness = weight κ(x,y,...) Nearby points contribute more Distant points contribute less (K vₜ)(x) = ∫ κ(x,y,…) vₜ(y) dy
Figure 2.4. The kernel integral at a single point $x$ (red). Every other point $y$ sends a contribution weighted by $\kappa(x,y,a(x),a(y))$. Arrow thickness represents kernel weight — nearby points typically contribute more.

Computational cost: This integral is $O(n^2)$. For each of $n$ grid points $x$, we sum over $n$ points $y$, and at each pair we evaluate $\kappa$ and multiply by $v_t(y)$. On a $64 \times 64$ grid ($n = 4096$), that's $\sim 16.8$ million kernel evaluations per layer. This is the bottleneck that the Fourier approach will solve in Chapter 3.


11The Projection Operator Q

After $T$ iterative layers, the hidden representation $v_T(x) \in \R^{\dv}$ must be mapped back to the output space $\R^{\du}$. The projection operator $Q$ mirrors the lifting operator $P$:

$$ u(x) = Q\bigl(v_T(x)\bigr), \qquad Q: \R^{\dv} \to \R^{\du}. $$

Like $P$, this is pointwise: the same linear map applied independently at every spatial location. For Darcy flow: $Q: \R^{32} \to \R^1$ — 32 channels collapse to a single scalar pressure value. In the FNO paper, $Q$ is implemented as two FC layers with ReLU between them (a small MLP).


12Putting It Together

Let's trace the full forward pass for our Darcy flow example with concrete dimensions. We use $T=4$ layers and $\dv = 32$ hidden channels.

StageOperationOutput at each point $x$
Input$a(x) \in \R^1$
Lift$v_0 = P(a)$$v_0(x) \in \R^{32}$
Layer 1$v_1 = \sigma(Wv_0 + \mathcal{K}v_0)$$v_1(x) \in \R^{32}$
Layer 2$v_2 = \sigma(Wv_1 + \mathcal{K}v_1)$$v_2(x) \in \R^{32}$
Layer 3$v_3 = \sigma(Wv_2 + \mathcal{K}v_2)$$v_3(x) \in \R^{32}$
Layer 4$v_4 = \sigma(Wv_3 + \mathcal{K}v_3)$$v_4(x) \in \R^{32}$
Project$u = Q(v_4)$$u(x) \in \R^1$

Now let's compare this to a standard neural network to see what's different:

PropertyStandard NNNeural Operator
InputVector $x \in \R^n$ (fixed size)Function $a: D \to \R^{\da}$ (any resolution)
OutputVector $y \in \R^m$ (fixed size)Function $u: D \to \R^{\du}$ (any resolution)
Spatial communicationVia weight matrix (all-to-all, fixed)Via kernel integral (learned, continuous)
ResolutionFixed: retrain for different input sizeFlexible: same weights, any grid
Parameters scale withInput dimension $n$Channel width $\dv$ (independent of $n$)

Physics connection: The kernel integral operator is structurally identical to a Green's function representation. In classical PDE theory, solutions can be written as $u(x) = \int_D G(x,y)\,f(y)\,dy$, where $G$ is the Green's function. The neural operator learns a generalized, parameterized version of this integral representation — but one that adapts to the input function $a$, not just a fixed source $f$.

The architecture is now fully specified — except for one critical question: how do we actually compute the kernel integral $\mathcal{K}$ efficiently? The naive approach costs $O(n^2)$, which is prohibitive. In the next chapter, we show how the Fourier transform reduces this to $O(n \log n)$.