7Architecture Overview: Lift → Iterate → Project
The neural operator has three stages, applied to the input function $a(x)$ at every point $x \in D$:
- Lifting ($P$): Map the input from $\R^{\da}$ to a higher-dimensional representation $\R^{\dv}$. This is a pointwise fully-connected layer applied independently at each spatial location.
- Iterative layers ($v_0 \to v_1 \to \cdots \to v_T$): Apply $T$ integral operator layers, each of which combines local information (at each point) with global information (from the entire domain).
- Projection ($Q$): Map back from $\R^{\dv}$ to $\R^{\du}$. Again pointwise, applied independently at each spatial location.
In symbols:
$$ a(x) \;\xrightarrow{\;P\;}\; v_0(x) \;\xrightarrow{\;\text{Layer 1}\;}\; v_1(x) \;\xrightarrow{\;\text{Layer 2}\;}\; \cdots \;\xrightarrow{\;\text{Layer }T\;}\; v_T(x) \;\xrightarrow{\;Q\;}\; u(x) $$8The Lifting Operator P
The lifting operator $P$ is a pointwise fully-connected network that maps each input value from $\R^{\da}$ to a higher-dimensional latent space $\R^{\dv}$:
$$ v_0(x) = P\bigl(a(x)\bigr), \qquad P: \R^{\da} \to \R^{\dv}. $$"Pointwise" means $P$ is applied independently at each spatial location $x$ — it does not look at neighboring points. It is the same linear map (same weights) at every $x$.
For Darcy flow: $P: \R^1 \to \R^{32}$. At each grid point $x$, the single scalar value $a(x)$ (permeability) is mapped to a 32-dimensional feature vector $v_0(x) \in \R^{32}$. This is equivalent to a fully-connected layer with weight matrix $W_P \in \R^{32 \times 1}$ and bias $b_P \in \R^{32}$.
Think of this like the first convolutional layer in image processing, but with a $1 \times 1$ kernel: no spatial mixing, just channel expansion. If the input were a grayscale image ($\da = 1$), $P$ converts it to a 32-channel feature map.
9The Iterative Update (Definition 1)
Each iterative layer updates the hidden representation $v_t(x) \in \R^{\dv}$ to $v_{t+1}(x) \in \R^{\dv}$ using two parallel paths:
where $W \in \R^{\dv \times \dv}$ is a pointwise linear map, $\mathcal{K}(a;\varphi)$ is a kernel integral operator (defined next), and $\sigma$ is a nonlinear activation (ReLU).
The two paths do fundamentally different things:
- Local path: $W v_t(x)$ — a matrix multiply that only uses information at point $x$. It mixes channels but has zero spatial awareness. This is a $1 \times 1$ convolution.
- Global path: $(\mathcal{K} v_t)(x)$ — an integral over the entire domain, gathering information from all points $y \in D$. This is where spatial communication happens.
The outputs are summed, a bias $b$ is added, and the result passes through ReLU.
10The Kernel Integral Operator (Definition 2)
The global path is an integral operator. The paper defines it precisely:
where $\kappa: \R^{2(d + \da)} \to \R^{\dv \times \dv}$ is a kernel function parameterized by a neural network with weights $\varphi$.
Let's break down every piece of this integral:
- For a fixed point $x$, we integrate over all points $y \in D$.
- The kernel $\kappa(x, y, a(x), a(y); \varphi)$ depends on:
- Where $x$ is (the "query" point)
- Where $y$ is (the "source" point)
- The input function at both points: $a(x)$ and $a(y)$
- Learned parameters $\varphi$
- $\kappa$ outputs a $\dv \times \dv$ matrix (for Darcy: $32 \times 32$), which multiplies $v_t(y) \in \R^{\dv}$ to produce a $\dv$-dimensional contribution.
- We integrate (sum) these contributions over all $y$, giving the result at $x$.
Concretely: at point $x = (0.5, 0.5)$, we sum up contributions from every other grid point $y$, each weighted by the kernel $\kappa$. Points where $\kappa$ is large contribute more; points where $\kappa$ is small contribute less.
Computational cost: This integral is $O(n^2)$. For each of $n$ grid points $x$, we sum over $n$ points $y$, and at each pair we evaluate $\kappa$ and multiply by $v_t(y)$. On a $64 \times 64$ grid ($n = 4096$), that's $\sim 16.8$ million kernel evaluations per layer. This is the bottleneck that the Fourier approach will solve in Chapter 3.
11The Projection Operator Q
After $T$ iterative layers, the hidden representation $v_T(x) \in \R^{\dv}$ must be mapped back to the output space $\R^{\du}$. The projection operator $Q$ mirrors the lifting operator $P$:
$$ u(x) = Q\bigl(v_T(x)\bigr), \qquad Q: \R^{\dv} \to \R^{\du}. $$Like $P$, this is pointwise: the same linear map applied independently at every spatial location. For Darcy flow: $Q: \R^{32} \to \R^1$ — 32 channels collapse to a single scalar pressure value. In the FNO paper, $Q$ is implemented as two FC layers with ReLU between them (a small MLP).
12Putting It Together
Let's trace the full forward pass for our Darcy flow example with concrete dimensions. We use $T=4$ layers and $\dv = 32$ hidden channels.
| Stage | Operation | Output at each point $x$ |
|---|---|---|
| Input | — | $a(x) \in \R^1$ |
| Lift | $v_0 = P(a)$ | $v_0(x) \in \R^{32}$ |
| Layer 1 | $v_1 = \sigma(Wv_0 + \mathcal{K}v_0)$ | $v_1(x) \in \R^{32}$ |
| Layer 2 | $v_2 = \sigma(Wv_1 + \mathcal{K}v_1)$ | $v_2(x) \in \R^{32}$ |
| Layer 3 | $v_3 = \sigma(Wv_2 + \mathcal{K}v_2)$ | $v_3(x) \in \R^{32}$ |
| Layer 4 | $v_4 = \sigma(Wv_3 + \mathcal{K}v_3)$ | $v_4(x) \in \R^{32}$ |
| Project | $u = Q(v_4)$ | $u(x) \in \R^1$ |
Now let's compare this to a standard neural network to see what's different:
| Property | Standard NN | Neural Operator |
|---|---|---|
| Input | Vector $x \in \R^n$ (fixed size) | Function $a: D \to \R^{\da}$ (any resolution) |
| Output | Vector $y \in \R^m$ (fixed size) | Function $u: D \to \R^{\du}$ (any resolution) |
| Spatial communication | Via weight matrix (all-to-all, fixed) | Via kernel integral (learned, continuous) |
| Resolution | Fixed: retrain for different input size | Flexible: same weights, any grid |
| Parameters scale with | Input dimension $n$ | Channel width $\dv$ (independent of $n$) |
Physics connection: The kernel integral operator is structurally identical to a Green's function representation. In classical PDE theory, solutions can be written as $u(x) = \int_D G(x,y)\,f(y)\,dy$, where $G$ is the Green's function. The neural operator learns a generalized, parameterized version of this integral representation — but one that adapts to the input function $a$, not just a fixed source $f$.
The architecture is now fully specified — except for one critical question: how do we actually compute the kernel integral $\mathcal{K}$ efficiently? The naive approach costs $O(n^2)$, which is prohibitive. In the next chapter, we show how the Fourier transform reduces this to $O(n \log n)$.