18The Full Fourier Layer
We now combine Definition 1 (the iterative update, Eq. 2) with Definition 3 (the Fourier integral operator, Eq. 4) to get the FNO layer — the core equation of the entire paper:
This is Definition 1 with the kernel integral $\mathcal{K}$ replaced by the Fourier parameterization from Definition 3.
Equation (5) is the complete specification of one FNO layer. Every implementation detail follows from this equation plus the choice of hyperparameters ($\dv$, $k_{\max}$, activation $\sigma$).
19Two Parallel Paths
Each FNO layer has two parallel data paths that serve complementary roles:
Why do we need both paths?
- Fourier path: Efficient global communication — every point talks to every other point via the Fourier transform. But it only captures low frequencies (due to mode truncation) and implicitly assumes periodic boundary conditions (a property of the discrete FFT).
- $W$ path: Captures high-frequency and localized features. It operates at full resolution with no periodicity assumption. It can represent sharp boundaries, discontinuities, and features that the truncated Fourier spectrum misses.
Key idea: $W$ is the unsung hero of FNO. Without it, FNO would be a pure low-pass filter — unable to represent sharp features or correct for the periodic boundary assumption. The Fourier path does the heavy lifting for smooth global structure; $W$ handles the local corrections.
20Discrete Implementation with FFT
On a uniform $s_1 \times s_2$ grid (for Darcy: $64 \times 64$), the Fourier path is implemented with the discrete FFT. The paper provides the discrete formulas:
where $l = (l_1, l_2)$ is the discrete frequency index, and $v_t(x_1, x_2) \in \R^{\dv}$ is the hidden representation at grid point $(x_1, x_2)$.
The complete algorithm for the Fourier path in one layer:
- Apply 2D FFT: Transform $v_t \in \R^{s_1 \times s_2 \times \dv}$ to $\hat{v}_t \in \mathbb{C}^{s_1 \times s_2 \times \dv}$. This uses a standard 2D FFT applied independently to each of the $\dv$ channels.
- Extract low modes: Keep only the $(k_{\max,1}, k_{\max,2})$ lowest-frequency coefficients. In the FFT output, modes $0, 1, \ldots, k_{\max}-1$ are the positive low frequencies. (Negative low frequencies wrap to the end of the array: indices $s - k_{\max}, \ldots, s-1$.) Extract this block: $\hat{v}_t^{\text{low}} \in \mathbb{C}^{k_{\max} \times k_{\max} \times \dv}$.
- Multiply by $R$: At each retained mode $(l_1, l_2)$, multiply $\hat{v}_t^{\text{low}}(l_1, l_2) \in \mathbb{C}^{\dv}$ by $R(l_1, l_2) \in \mathbb{C}^{\dv \times \dv}$. Result: $\hat{w}(l_1, l_2) \in \mathbb{C}^{\dv}$.
- Zero-pad: Create a full-size tensor $\hat{w}_{\text{full}} \in \mathbb{C}^{s_1 \times s_2 \times \dv}$, initialized to zero. Insert $\hat{w}$ at the low-mode positions.
- Apply inverse 2D FFT: Transform back to physical space. Take the real part: the Fourier path output $\in \R^{s_1 \times s_2 \times \dv}$.
Simultaneously, the local path computes $W \cdot v_t(x)$ at every grid point (a batched matrix-vector multiply). The two results are summed, bias is added, and ReLU is applied.
21Full Forward Pass with Dimensions
Let's trace the complete FNO forward pass for Darcy flow with concrete tensor shapes. Grid: $64 \times 64$, channels $\dv = 32$, modes $k_{\max} = 12$, layers $T = 4$.
| Stage | Operation | Tensor Shape |
|---|---|---|
| Input | Permeability field $a$ on grid | $(64, 64, 1)$ |
| Lift $P$ | Pointwise FC: $\R^1 \to \R^{32}$ | $(64, 64, 32)$ |
| — Fourier Layer 1 — | ||
| FFT | 2D FFT per channel | $(64, 64, 32)$ complex |
| Extract | Keep low modes | $(12, 12, 32)$ complex |
| $\times R_1$ | $R_1 \in \mathbb{C}^{12 \times 12 \times 32 \times 32}$ | $(12, 12, 32)$ complex |
| Zero-pad | Insert into full-size tensor | $(64, 64, 32)$ complex |
| IFFT | Inverse 2D FFT, take real part | $(64, 64, 32)$ real |
| $+ W_1 v_0$ | Sum with local path | $(64, 64, 32)$ |
| ReLU | Activation | $(64, 64, 32)$ |
| — Fourier Layers 2–4: same shapes — | ||
| Project $Q$ | Pointwise MLP: $\R^{32} \to \R^{128} \to \R^1$ | $(64, 64, 1)$ |
| Output | Predicted pressure $u$ | $(64, 64, 1)$ |
22Complexity Analysis
Let $n = s_1 \times s_2$ be the total number of grid points. For one layer:
| Method | Kernel Integration Cost | With $n=4096$ |
|---|---|---|
| Naive (Def. 2) | $O(n^2 \cdot \dv^2)$ | $4096^2 \times 32^2 \approx 1.7 \times 10^{10}$ |
| GNO (graph) | $O(n \cdot m \cdot \dv^2)$ ($m$ = neighbors) | $4096 \times 16 \times 1024 \approx 6.7 \times 10^{7}$ |
| FNO (Fourier) | $O(n \log n \cdot \dv + k_{\max}^d \cdot \dv^2)$ | $4096 \times 12 \times 32 + 144 \times 1024 \approx 1.7 \times 10^{6}$ |
Breaking down the FNO cost:
- FFT: $O(n \log n)$ per channel, times $\dv$ channels. For $n=4096$, $\dv=32$: $\approx 4096 \times 12 \times 32 \approx 1.6 \times 10^6$ operations.
- $R$ multiplication: At each of $k_{\max}^2 = 144$ retained modes, a $32 \times 32$ matrix–vector product: $144 \times 1024 \approx 1.5 \times 10^5$ operations.
- $W$ path: $n \times \dv^2 = 4096 \times 1024 \approx 4.2 \times 10^6$ operations.
- Total per layer: $\sim 5.9 \times 10^6$ — about 3000× cheaper than the naive kernel integral.
The scaling story: as $n$ grows (finer grids), the FFT cost grows as $O(n \log n)$ while the $R$ multiplication stays constant (it depends only on $k_{\max}$, not $n$). The dominant cost shifts to the FFT, giving quasi-linear overall scaling.
Why this is remarkable: A traditional PDE solver (e.g., multigrid for Darcy) also scales roughly as $O(n)$ or $O(n \log n)$ per solve. But FNO does this for inference (one forward pass), while the traditional solver does it for one specific input. FNO trains once and then handles any input at the same cost. The expensive part — solving 1000 PDEs to generate training data — is a one-time upfront cost.
With the architecture fully specified and the complexity understood, we turn in the final chapter to how FNO performs on real PDE benchmarks — starting with Burgers' equation, then our Darcy flow running example, and finally Navier–Stokes.