The Neural Operator Architecture

In this chapter

7Architecture Overview: Lift → Iterate → Project
8The Lifting Operator P
9The Iterative Update (Definition 1)
10The Kernel Integral Operator (Definition 2)
11The Projection Operator Q
12Putting It Together

7Architecture Overview: Lift → Iterate → Project

The neural operator has three stages, applied to the input function $a(x)$ at every point $x \in D$:

Lifting ($P$): Map the input from $\R^{\da}$ to a higher-dimensional representation $\R^{\dv}$. This is a pointwise fully-connected layer applied independently at each spatial location.
Iterative layers ($v_0 \to v_1 \to \cdots \to v_T$): Apply $T$ integral operator layers, each of which combines local information (at each point) with global information (from the entire domain).
Projection ($Q$): Map back from $\R^{\dv}$ to $\R^{\du}$. Again pointwise, applied independently at each spatial location.

In symbols:

$$ a(x) \;\xrightarrow{\;P\;}\; v_0(x) \;\xrightarrow{\;\text{Layer 1}\;}\; v_1(x) \;\xrightarrow{\;\text{Layer 2}\;}\; \cdots \;\xrightarrow{\;\text{Layer }T\;}\; v_T(x) \;\xrightarrow{\;Q\;}\; u(x) $$

Figure 2.1. The neural operator pipeline. $P$ lifts from $\R^{\da}$ to $\R^{\dv}$, four iterative layers transform in $\R^{\dv}$, and $Q$ projects back to $\R^{\du}$. For Darcy flow: $\da = 1$, $\dv = 32$, $\du = 1$.

8The Lifting Operator P

The lifting operator $P$ is a pointwise fully-connected network that maps each input value from $\R^{\da}$ to a higher-dimensional latent space $\R^{\dv}$:

$$ v_0(x) = P\bigl(a(x)\bigr), \qquad P: \R^{\da} \to \R^{\dv}. $$

"Pointwise" means $P$ is applied independently at each spatial location $x$ — it does not look at neighboring points. It is the same linear map (same weights) at every $x$.

Lifting Operator

For Darcy flow: $P: \R^1 \to \R^{32}$. At each grid point $x$, the single scalar value $a(x)$ (permeability) is mapped to a 32-dimensional feature vector $v_0(x) \in \R^{32}$. This is equivalent to a fully-connected layer with weight matrix $W_P \in \R^{32 \times 1}$ and bias $b_P \in \R^{32}$.

Figure 2.2. The lifting operator $P$ at a single grid point: one scalar value fans out to $\dv = 32$ channels. Applied identically at every point $x \in D$.

Think of this like the first convolutional layer in image processing, but with a $1 \times 1$ kernel: no spatial mixing, just channel expansion. If the input were a grayscale image ($\da = 1$), $P$ converts it to a 32-channel feature map.

9The Iterative Update (Definition 1)

Each iterative layer updates the hidden representation $v_t(x) \in \R^{\dv}$ to $v_{t+1}(x) \in \R^{\dv}$ using two parallel paths:

Definition 1: Iterative Update

$$ v_{t+1}(x) = \sigma\!\Bigl(\underbrace{W\, v_t(x)}_{\text{local}} + \underbrace{\bigl(\mathcal{K}(a;\,\varphi)\, v_t\bigr)(x)}_{\text{global}} + b\Bigr) \tag{2} $$

where $W \in \R^{\dv \times \dv}$ is a pointwise linear map, $\mathcal{K}(a;\varphi)$ is a kernel integral operator (defined next), and $\sigma$ is a nonlinear activation (ReLU).

The two paths do fundamentally different things:

Local path: $W v_t(x)$ — a matrix multiply that only uses information at point $x$. It mixes channels but has zero spatial awareness. This is a $1 \times 1$ convolution.
Global path: $(\mathcal{K} v_t)(x)$ — an integral over the entire domain, gathering information from all points $y \in D$. This is where spatial communication happens.

The outputs are summed, a bias $b$ is added, and the result passes through ReLU.

Figure 2.3. One iterative layer: the input $v_t(x)$ splits into a local path ($W$, pointwise) and a global path ($\mathcal{K}$, integral over the domain). The results are summed and passed through ReLU.

10The Kernel Integral Operator (Definition 2)

The global path is an integral operator. The paper defines it precisely:

Definition 2: Kernel Integral Operator

$$ \bigl(\mathcal{K}(a;\,\varphi)\, v_t\bigr)(x) = \int_D \kappa\bigl(x,\, y,\, a(x),\, a(y);\, \varphi\bigr)\, v_t(y)\, dy \tag{3} $$

where $\kappa: \R^{2(d + \da)} \to \R^{\dv \times \dv}$ is a kernel function parameterized by a neural network with weights $\varphi$.

Let's break down every piece of this integral:

For a fixed point $x$, we integrate over all points $y \in D$.
The kernel $\kappa(x, y, a(x), a(y); \varphi)$ depends on:
- Where $x$ is (the "query" point)
- Where $y$ is (the "source" point)
- The input function at both points: $a(x)$ and $a(y)$
- Learned parameters $\varphi$
$\kappa$ outputs a $\dv \times \dv$ matrix (for Darcy: $32 \times 32$), which multiplies $v_t(y) \in \R^{\dv}$ to produce a $\dv$-dimensional contribution.
We integrate (sum) these contributions over all $y$, giving the result at $x$.

Concretely: at point $x = (0.5, 0.5)$, we sum up contributions from every other grid point $y$, each weighted by the kernel $\kappa$. Points where $\kappa$ is large contribute more; points where $\kappa$ is small contribute less.

Figure 2.4. The kernel integral at a single point $x$ (red). Every other point $y$ sends a contribution weighted by $\kappa(x,y,a(x),a(y))$. Arrow thickness represents kernel weight — nearby points typically contribute more.

Computational cost: This integral is $O(n^2)$. For each of $n$ grid points $x$, we sum over $n$ points $y$, and at each pair we evaluate $\kappa$ and multiply by $v_t(y)$. On a $64 \times 64$ grid ($n = 4096$), that's $\sim 16.8$ million kernel evaluations per layer. This is the bottleneck that the Fourier approach will solve in Chapter 3.

11The Projection Operator Q

After $T$ iterative layers, the hidden representation $v_T(x) \in \R^{\dv}$ must be mapped back to the output space $\R^{\du}$. The projection operator $Q$ mirrors the lifting operator $P$:

$$ u(x) = Q\bigl(v_T(x)\bigr), \qquad Q: \R^{\dv} \to \R^{\du}. $$

Like $P$, this is pointwise: the same linear map applied independently at every spatial location. For Darcy flow: $Q: \R^{32} \to \R^1$ — 32 channels collapse to a single scalar pressure value. In the FNO paper, $Q$ is implemented as two FC layers with ReLU between them (a small MLP).

12Putting It Together

Let's trace the full forward pass for our Darcy flow example with concrete dimensions. We use $T=4$ layers and $\dv = 32$ hidden channels.

Stage	Operation	Output at each point $x$
Input	—	$a(x) \in \R^1$
Lift	$v_0 = P(a)$	$v_0(x) \in \R^{32}$
Layer 1	$v_1 = \sigma(Wv_0 + \mathcal{K}v_0)$	$v_1(x) \in \R^{32}$
Layer 2	$v_2 = \sigma(Wv_1 + \mathcal{K}v_1)$	$v_2(x) \in \R^{32}$
Layer 3	$v_3 = \sigma(Wv_2 + \mathcal{K}v_2)$	$v_3(x) \in \R^{32}$
Layer 4	$v_4 = \sigma(Wv_3 + \mathcal{K}v_3)$	$v_4(x) \in \R^{32}$
Project	$u = Q(v_4)$	$u(x) \in \R^1$

Now let's compare this to a standard neural network to see what's different:

Property	Standard NN	Neural Operator
Input	Vector $x \in \R^n$ (fixed size)	Function $a: D \to \R^{\da}$ (any resolution)
Output	Vector $y \in \R^m$ (fixed size)	Function $u: D \to \R^{\du}$ (any resolution)
Spatial communication	Via weight matrix (all-to-all, fixed)	Via kernel integral (learned, continuous)
Resolution	Fixed: retrain for different input size	Flexible: same weights, any grid
Parameters scale with	Input dimension $n$	Channel width $\dv$ (independent of $n$)

Physics connection: The kernel integral operator is structurally identical to a Green's function representation. In classical PDE theory, solutions can be written as $u(x) = \int_D G(x,y)\,f(y)\,dy$, where $G$ is the Green's function. The neural operator learns a generalized, parameterized version of this integral representation — but one that adapts to the input function $a$, not just a fixed source $f$.

The architecture is now fully specified — except for one critical question: how do we actually compute the kernel integral $\mathcal{K}$ efficiently? The naive approach costs $O(n^2)$, which is prohibitive. In the next chapter, we show how the Fourier transform reduces this to $O(n \log n)$.