Chapter 4

The Complete FNO Layer & Forward Pass

Combining the Fourier path and the local path into one layer, the discrete FFT implementation, a full dimensional walkthrough, and complexity analysis.

In this chapter
  1. 18The Full Fourier Layer
  2. 19Two Parallel Paths
  3. 20Discrete Implementation with FFT
  4. 21Full Forward Pass with Dimensions
  5. 22Complexity Analysis

18The Full Fourier Layer

We now combine Definition 1 (the iterative update, Eq. 2) with Definition 3 (the Fourier integral operator, Eq. 4) to get the FNO layer — the core equation of the entire paper:

The FNO Layer
$$ v_{t+1}(x) = \sigma\!\Bigl(W\, v_t(x) \;+\; \Fti\!\bigl(R_\varphi \cdot (\Ft\, v_t)\bigr)(x) \;+\; b\Bigr) \tag{5} $$

This is Definition 1 with the kernel integral $\mathcal{K}$ replaced by the Fourier parameterization from Definition 3.

Equation (5) is the complete specification of one FNO layer. Every implementation detail follows from this equation plus the choice of hyperparameters ($\dv$, $k_{\max}$, activation $\sigma$).


19Two Parallel Paths

Each FNO layer has two parallel data paths that serve complementary roles:

vₜ FFT keep low kₘₐₓ modes × R learned zero-pad high modes IFFT Global path (Fourier): low-frequency, all-to-all communication Local path (pointwise): high-frequency, independent at each x + σ vₜ₊₁
Figure 4.1. Data flow in one FNO layer. The global Fourier path (top) handles low-frequency communication across the entire domain. The local $W$ path (bottom) handles high-frequency and pointwise features. They sum before the activation.

Why do we need both paths?

Key idea: $W$ is the unsung hero of FNO. Without it, FNO would be a pure low-pass filter — unable to represent sharp features or correct for the periodic boundary assumption. The Fourier path does the heavy lifting for smooth global structure; $W$ handles the local corrections.


20Discrete Implementation with FFT

On a uniform $s_1 \times s_2$ grid (for Darcy: $64 \times 64$), the Fourier path is implemented with the discrete FFT. The paper provides the discrete formulas:

Discrete Fourier Transform (2D)
$$ (\Ft\, v_t)_l = \sum_{x_1=0}^{s_1-1} \sum_{x_2=0}^{s_2-1} v_t(x_1, x_2)\, e^{-2\pi i\left(\frac{x_1 l_1}{s_1} + \frac{x_2 l_2}{s_2}\right)} \tag{6} $$

where $l = (l_1, l_2)$ is the discrete frequency index, and $v_t(x_1, x_2) \in \R^{\dv}$ is the hidden representation at grid point $(x_1, x_2)$.

The complete algorithm for the Fourier path in one layer:

  1. Apply 2D FFT: Transform $v_t \in \R^{s_1 \times s_2 \times \dv}$ to $\hat{v}_t \in \mathbb{C}^{s_1 \times s_2 \times \dv}$. This uses a standard 2D FFT applied independently to each of the $\dv$ channels.
  2. Extract low modes: Keep only the $(k_{\max,1}, k_{\max,2})$ lowest-frequency coefficients. In the FFT output, modes $0, 1, \ldots, k_{\max}-1$ are the positive low frequencies. (Negative low frequencies wrap to the end of the array: indices $s - k_{\max}, \ldots, s-1$.) Extract this block: $\hat{v}_t^{\text{low}} \in \mathbb{C}^{k_{\max} \times k_{\max} \times \dv}$.
  3. Multiply by $R$: At each retained mode $(l_1, l_2)$, multiply $\hat{v}_t^{\text{low}}(l_1, l_2) \in \mathbb{C}^{\dv}$ by $R(l_1, l_2) \in \mathbb{C}^{\dv \times \dv}$. Result: $\hat{w}(l_1, l_2) \in \mathbb{C}^{\dv}$.
  4. Zero-pad: Create a full-size tensor $\hat{w}_{\text{full}} \in \mathbb{C}^{s_1 \times s_2 \times \dv}$, initialized to zero. Insert $\hat{w}$ at the low-mode positions.
  5. Apply inverse 2D FFT: Transform back to physical space. Take the real part: the Fourier path output $\in \R^{s_1 \times s_2 \times \dv}$.

Simultaneously, the local path computes $W \cdot v_t(x)$ at every grid point (a batched matrix-vector multiply). The two results are summed, bias is added, and ReLU is applied.


21Full Forward Pass with Dimensions

Let's trace the complete FNO forward pass for Darcy flow with concrete tensor shapes. Grid: $64 \times 64$, channels $\dv = 32$, modes $k_{\max} = 12$, layers $T = 4$.

StageOperationTensor Shape
InputPermeability field $a$ on grid$(64, 64, 1)$
Lift $P$Pointwise FC: $\R^1 \to \R^{32}$$(64, 64, 32)$
— Fourier Layer 1 —
FFT2D FFT per channel$(64, 64, 32)$ complex
ExtractKeep low modes$(12, 12, 32)$ complex
$\times R_1$$R_1 \in \mathbb{C}^{12 \times 12 \times 32 \times 32}$$(12, 12, 32)$ complex
Zero-padInsert into full-size tensor$(64, 64, 32)$ complex
IFFTInverse 2D FFT, take real part$(64, 64, 32)$ real
$+ W_1 v_0$Sum with local path$(64, 64, 32)$
ReLUActivation$(64, 64, 32)$
— Fourier Layers 2–4: same shapes —
Project $Q$Pointwise MLP: $\R^{32} \to \R^{128} \to \R^1$$(64, 64, 1)$
OutputPredicted pressure $u$$(64, 64, 1)$
64×64 ×1 input P Fourier Layer 1 64×64×32 Fourier Layer 2 64×64×32 Fourier Layer 3 64×64×32 Fourier Layer 4 64×64×32 Q 64×64 ×1 output 32 params ~296K params each 4 × 296K ≈ 1.18M params 4.2K params Total: ~1.19M parameters (independent of grid resolution — same weights for 16×16, 64×64, or 256×256)
Figure 4.2. Complete FNO architecture for 2D Darcy flow. Four Fourier layers with 32 channels, $k_{\max}=12$ modes. Total ~1.19M parameters, independent of the spatial grid resolution.

22Complexity Analysis

Let $n = s_1 \times s_2$ be the total number of grid points. For one layer:

MethodKernel Integration CostWith $n=4096$
Naive (Def. 2) $O(n^2 \cdot \dv^2)$ $4096^2 \times 32^2 \approx 1.7 \times 10^{10}$
GNO (graph) $O(n \cdot m \cdot \dv^2)$ ($m$ = neighbors) $4096 \times 16 \times 1024 \approx 6.7 \times 10^{7}$
FNO (Fourier) $O(n \log n \cdot \dv + k_{\max}^d \cdot \dv^2)$ $4096 \times 12 \times 32 + 144 \times 1024 \approx 1.7 \times 10^{6}$

Breaking down the FNO cost:

The scaling story: as $n$ grows (finer grids), the FFT cost grows as $O(n \log n)$ while the $R$ multiplication stays constant (it depends only on $k_{\max}$, not $n$). The dominant cost shifts to the FFT, giving quasi-linear overall scaling.

Why this is remarkable: A traditional PDE solver (e.g., multigrid for Darcy) also scales roughly as $O(n)$ or $O(n \log n)$ per solve. But FNO does this for inference (one forward pass), while the traditional solver does it for one specific input. FNO trains once and then handles any input at the same cost. The expensive part — solving 1000 PDEs to generate training data — is a one-time upfront cost.

With the architecture fully specified and the complexity understood, we turn in the final chapter to how FNO performs on real PDE benchmarks — starting with Burgers' equation, then our Darcy flow running example, and finally Navier–Stokes.