The Complete FNO Layer & Forward Pass

In this chapter

18The Full Fourier Layer
19Two Parallel Paths
20Discrete Implementation with FFT
21Full Forward Pass with Dimensions
22Complexity Analysis

18The Full Fourier Layer

We now combine Definition 1 (the iterative update, Eq. 2) with Definition 3 (the Fourier integral operator, Eq. 4) to get the FNO layer — the core equation of the entire paper:

The FNO Layer

$$ v_{t+1}(x) = \sigma\!\Bigl(W\, v_t(x) \;+\; \Fti\!\bigl(R_\varphi \cdot (\Ft\, v_t)\bigr)(x) \;+\; b\Bigr) \tag{5} $$

This is Definition 1 with the kernel integral $\mathcal{K}$ replaced by the Fourier parameterization from Definition 3.

Equation (5) is the complete specification of one FNO layer. Every implementation detail follows from this equation plus the choice of hyperparameters ($\dv$, $k_{\max}$, activation $\sigma$).

19Two Parallel Paths

Each FNO layer has two parallel data paths that serve complementary roles:

Figure 4.1. Data flow in one FNO layer. The global Fourier path (top) handles low-frequency communication across the entire domain. The local $W$ path (bottom) handles high-frequency and pointwise features. They sum before the activation.

Why do we need both paths?

Fourier path: Efficient global communication — every point talks to every other point via the Fourier transform. But it only captures low frequencies (due to mode truncation) and implicitly assumes periodic boundary conditions (a property of the discrete FFT).
$W$ path: Captures high-frequency and localized features. It operates at full resolution with no periodicity assumption. It can represent sharp boundaries, discontinuities, and features that the truncated Fourier spectrum misses.

Key idea: $W$ is the unsung hero of FNO. Without it, FNO would be a pure low-pass filter — unable to represent sharp features or correct for the periodic boundary assumption. The Fourier path does the heavy lifting for smooth global structure; $W$ handles the local corrections.

20Discrete Implementation with FFT

On a uniform $s_1 \times s_2$ grid (for Darcy: $64 \times 64$), the Fourier path is implemented with the discrete FFT. The paper provides the discrete formulas:

Discrete Fourier Transform (2D) $$ (\Ft\, v_t)_l = \sum_{x_1=0}^{s_1-1} \sum_{x_2=0}^{s_2-1} v_t(x_1, x_2)\, e^{-2\pi i\left(\frac{x_1 l_1}{s_1} + \frac{x_2 l_2}{s_2}\right)} \tag{6} $$

where $l = (l_1, l_2)$ is the discrete frequency index, and $v_t(x_1, x_2) \in \R^{\dv}$ is the hidden representation at grid point $(x_1, x_2)$.

The complete algorithm for the Fourier path in one layer:

Apply 2D FFT: Transform $v_t \in \R^{s_1 \times s_2 \times \dv}$ to $\hat{v}_t \in \mathbb{C}^{s_1 \times s_2 \times \dv}$. This uses a standard 2D FFT applied independently to each of the $\dv$ channels.
Extract low modes: Keep only the $(k_{\max,1}, k_{\max,2})$ lowest-frequency coefficients. In the FFT output, modes $0, 1, \ldots, k_{\max}-1$ are the positive low frequencies. (Negative low frequencies wrap to the end of the array: indices $s - k_{\max}, \ldots, s-1$.) Extract this block: $\hat{v}_t^{\text{low}} \in \mathbb{C}^{k_{\max} \times k_{\max} \times \dv}$.
Multiply by $R$: At each retained mode $(l_1, l_2)$, multiply $\hat{v}_t^{\text{low}}(l_1, l_2) \in \mathbb{C}^{\dv}$ by $R(l_1, l_2) \in \mathbb{C}^{\dv \times \dv}$. Result: $\hat{w}(l_1, l_2) \in \mathbb{C}^{\dv}$.
Zero-pad: Create a full-size tensor $\hat{w}_{\text{full}} \in \mathbb{C}^{s_1 \times s_2 \times \dv}$, initialized to zero. Insert $\hat{w}$ at the low-mode positions.
Apply inverse 2D FFT: Transform back to physical space. Take the real part: the Fourier path output $\in \R^{s_1 \times s_2 \times \dv}$.

Simultaneously, the local path computes $W \cdot v_t(x)$ at every grid point (a batched matrix-vector multiply). The two results are summed, bias is added, and ReLU is applied.

21Full Forward Pass with Dimensions

Let's trace the complete FNO forward pass for Darcy flow with concrete tensor shapes. Grid: $64 \times 64$, channels $\dv = 32$, modes $k_{\max} = 12$, layers $T = 4$.

Stage	Operation	Tensor Shape
Input	Permeability field $a$ on grid	$(64, 64, 1)$
Lift $P$	Pointwise FC: $\R^1 \to \R^{32}$	$(64, 64, 32)$
— Fourier Layer 1 —
FFT	2D FFT per channel	$(64, 64, 32)$ complex
Extract	Keep low modes	$(12, 12, 32)$ complex
$\times R_1$	$R_1 \in \mathbb{C}^{12 \times 12 \times 32 \times 32}$	$(12, 12, 32)$ complex
Zero-pad	Insert into full-size tensor	$(64, 64, 32)$ complex
IFFT	Inverse 2D FFT, take real part	$(64, 64, 32)$ real
$+ W_1 v_0$	Sum with local path	$(64, 64, 32)$
ReLU	Activation	$(64, 64, 32)$
— Fourier Layers 2–4: same shapes —
Project $Q$	Pointwise MLP: $\R^{32} \to \R^{128} \to \R^1$	$(64, 64, 1)$
Output	Predicted pressure $u$	$(64, 64, 1)$

Figure 4.2. Complete FNO architecture for 2D Darcy flow. Four Fourier layers with 32 channels, $k_{\max}=12$ modes. Total ~1.19M parameters, independent of the spatial grid resolution.

22Complexity Analysis

Let $n = s_1 \times s_2$ be the total number of grid points. For one layer:

Method	Kernel Integration Cost	With $n=4096$
Naive (Def. 2)	$O(n^2 \cdot \dv^2)$	$4096^2 \times 32^2 \approx 1.7 \times 10^{10}$
GNO (graph)	$O(n \cdot m \cdot \dv^2)$ ($m$ = neighbors)	$4096 \times 16 \times 1024 \approx 6.7 \times 10^{7}$
FNO (Fourier)	$O(n \log n \cdot \dv + k_{\max}^d \cdot \dv^2)$	$4096 \times 12 \times 32 + 144 \times 1024 \approx 1.7 \times 10^{6}$

Breaking down the FNO cost:

FFT: $O(n \log n)$ per channel, times $\dv$ channels. For $n=4096$, $\dv=32$: $\approx 4096 \times 12 \times 32 \approx 1.6 \times 10^6$ operations.
$R$ multiplication: At each of $k_{\max}^2 = 144$ retained modes, a $32 \times 32$ matrix–vector product: $144 \times 1024 \approx 1.5 \times 10^5$ operations.
$W$ path: $n \times \dv^2 = 4096 \times 1024 \approx 4.2 \times 10^6$ operations.
Total per layer: $\sim 5.9 \times 10^6$ — about 3000× cheaper than the naive kernel integral.

The scaling story: as $n$ grows (finer grids), the FFT cost grows as $O(n \log n)$ while the $R$ multiplication stays constant (it depends only on $k_{\max}$, not $n$). The dominant cost shifts to the FFT, giving quasi-linear overall scaling.

Why this is remarkable: A traditional PDE solver (e.g., multigrid for Darcy) also scales roughly as $O(n)$ or $O(n \log n)$ per solve. But FNO does this for inference (one forward pass), while the traditional solver does it for one specific input. FNO trains once and then handles any input at the same cost. The expensive part — solving 1000 PDEs to generate training data — is a one-time upfront cost.

With the architecture fully specified and the complexity understood, we turn in the final chapter to how FNO performs on real PDE benchmarks — starting with Burgers' equation, then our Darcy flow running example, and finally Navier–Stokes.