13Brief Fourier Refresher
The Fourier transform decomposes a function into its frequency components. For a function $f: D \to \R$ on a periodic domain, the continuous Fourier transform pair is:
Here $k \in \Z^d$ is the frequency (or mode) index. Low $|k|$ corresponds to smooth, slowly-varying components; high $|k|$ corresponds to rapidly oscillating components.
The single most important fact for this chapter:
Convolution in physical space is equivalent to pointwise multiplication in Fourier space:
$$ \Ft(f * g) = \Ft(f) \cdot \Ft(g) $$Equivalently: $f * g = \Fti\bigl(\Ft(f) \cdot \Ft(g)\bigr)$.
Why does this matter? A convolution integral $\int f(x-y)\,g(y)\,dy$ costs $O(n^2)$ in physical space. But by the convolution theorem, we can compute it as: FFT both functions ($O(n\log n)$), multiply pointwise ($O(n)$), inverse FFT ($O(n\log n)$). Total: $O(n \log n)$ instead of $O(n^2)$.
14Translation-Invariant Kernels
Recall from Definition 2 that the kernel integral is:
$$ (\mathcal{K}\, v_t)(x) = \int_D \kappa(x, y, a(x), a(y))\, v_t(y)\, dy. $$This kernel depends on both $x$ and $y$ independently, plus $a(x)$ and $a(y)$. Now the paper makes a simplifying assumption: what if the kernel depends only on the relative position $x - y$?
$$ \kappa(x, y) \;\to\; \kappa(x - y) $$This is called translation invariance: the kernel doesn't care about absolute position, only the displacement between points. With this simplification, the integral becomes a convolution:
$$ (\mathcal{K}\, v_t)(x) = \int_D \kappa(x - y)\, v_t(y)\, dy = (\kappa * v_t)(x). $$And by the convolution theorem:
$$ \mathcal{K}\, v_t = \Fti\!\bigl(\Ft(\kappa) \cdot \Ft(v_t)\bigr). $$Note the trade-off: we dropped the dependence on $a(x)$ and $a(y)$ from the kernel. This means the kernel can no longer adapt to the input function. However, the $W\,v_t(x)$ path (the local, pointwise path from Definition 1) still sees the lifted representation of $a$, so information about $a$ is not entirely lost.
15The Key Insight (Definition 3)
The convolution theorem says we could learn $\kappa$ in physical space, Fourier-transform it, multiply, and inverse-transform. But the paper makes a bolder move:
Key idea: Don't learn $\kappa$ in physical space at all. Learn the Fourier-space representation $R_\varphi$ directly. The kernel $\kappa$ never needs to exist — we parameterize $R = \Ft(\kappa)$ as a learnable tensor and work entirely in frequency space.
where $R_\varphi$ is a learnable complex-valued tensor parameterized by $\varphi$. For each frequency mode $k$: $R_\varphi(k) \in \mathbb{C}^{\dv \times \dv}$, and $(\Ft\, v_t)(k) \in \mathbb{C}^{\dv}$. The "multiplication" $R \cdot \Ft(v_t)$ is a matrix–vector product at each frequency mode.
Let's unpack the computation step by step:
- Fourier transform $v_t$: At each frequency mode $k$, get $(\Ft\, v_t)(k) \in \mathbb{C}^{\dv}$ — a complex vector of $\dv$ channels.
- Multiply by $R_\varphi(k)$: At each mode $k$, apply the learned $\dv \times \dv$ complex matrix $R_\varphi(k)$ to the vector $(\Ft\, v_t)(k)$. This mixes channels at that frequency.
- Inverse Fourier transform: Transform back to physical space to get $(\mathcal{K}\, v_t)(x)$.
The crucial point: $R_\varphi$ is defined on frequency modes, not on spatial grid points. Its shape depends on how many modes we keep (a hyperparameter), not on the grid resolution $n$. This is the source of resolution invariance.
16What R Looks Like Concretely
For 2D Darcy flow with $\dv = 32$ channels and $k_{\max} = 12$ retained modes in each spatial dimension:
$$ R_\varphi \in \mathbb{C}^{k_{\max} \times k_{\max} \times \dv \times \dv} = \mathbb{C}^{12 \times 12 \times 32 \times 32}. $$At each of the $12 \times 12 = 144$ retained frequency modes, $R$ is a $32 \times 32$ complex-valued matrix. This matrix mixes channels — it decides how the 32 features interact at that particular frequency.
Parameter count per layer: Since each complex number has a real and imaginary part, the total number of real-valued parameters in $R$ is:
$$ 2 \times 12 \times 12 \times 32 \times 32 = 294{,}912 \text{ parameters.} $$Add $W \in \R^{32 \times 32} = 1{,}024$ parameters for the local path, and each Fourier layer has about 296K parameters total. With 4 layers, that's roughly 1.2M parameters — small by modern standards.
Mini numerical example
To make this completely concrete, consider a tiny case with $\dv = 2$ channels and $k_{\max} = 2$ (keeping 4 modes total in 1D). The $R$ tensor is:
$$ R(k) = \begin{pmatrix} r_{11}(k) & r_{12}(k) \\ r_{21}(k) & r_{22}(k) \end{pmatrix} \in \mathbb{C}^{2 \times 2}, \qquad k = 0, 1. $$Suppose at mode $k=1$:
$$ R(1) = \begin{pmatrix} 0.5 + 0.1i & -0.3 + 0.2i \\ 0.1 - 0.4i & 0.6 + 0.0i \end{pmatrix}, \qquad (\Ft\, v_t)(1) = \begin{pmatrix} 1.0 + 0.5i \\ -0.2 + 0.8i \end{pmatrix}. $$Then:
$$ R(1) \cdot (\Ft\, v_t)(1) = \begin{pmatrix} (0.5+0.1i)(1.0+0.5i) + (-0.3+0.2i)(-0.2+0.8i) \\ (0.1-0.4i)(1.0+0.5i) + (0.6+0.0i)(-0.2+0.8i) \end{pmatrix}. $$Each entry is a complex multiply-and-accumulate — standard linear algebra, just with complex numbers instead of reals. This happens independently at each frequency mode $k$.
17Mode Truncation
On a $64 \times 64$ grid, the FFT produces $64 \times 64 = 4096$ frequency modes. But FNO only learns $R$ for the lowest $k_{\max}$ modes in each dimension. The paper defines the retained set:
$$ \Z_{k_{\max}} = \bigl\{k \in \Z^d : |k_j| \leq k_{\max,j} \text{ for } j = 1, \ldots, d\bigr\}. $$With $k_{\max} = 12$ on a $64 \times 64$ grid, we keep the $12 \times 12 = 144$ lowest-frequency modes and discard the remaining $4096 - 144 = 3952$ high-frequency modes. The high-frequency Fourier coefficients are set to zero.
Why does this work? Two reasons:
- PDE solutions are smooth. Solutions to elliptic PDEs like Darcy flow are analytic away from coefficient discontinuities. Smooth functions have Fourier spectra that decay rapidly — most of the energy is in the low-frequency modes. Keeping $k_{\max} = 12$ modes captures the dominant features of the solution.
- The $W$ path handles the rest. The pointwise linear map $W\,v_t(x)$ in Definition 1 operates in physical space at full resolution. It can represent high-frequency, localized features that the truncated Fourier path misses. The two paths are complementary.
Physics intuition: This is exactly the physicist's intuition about spectral methods. Smooth fields (temperature, pressure, low-Reynolds fluid flow) are well-described by a handful of Fourier modes. Turbulent flows need more modes; shocks and discontinuities need many. FNO works best when the solution is smooth — which is precisely the regime where spectral methods excel.
Mode truncation also explains resolution invariance. The tensor $R$ is defined on the $12 \times 12$ lowest modes. If we evaluate on a $128 \times 128$ grid instead of $64 \times 64$, the FFT produces $128 \times 128$ modes, but we still only touch the same $12 \times 12$ lowest ones. The same $R$ works on any grid — the mode indices don't change.
We now have all the ingredients. In the next chapter, we combine Definition 1 (the iterative update) with Definition 3 (the Fourier integral operator) into the complete FNO layer, and trace a full forward pass with explicit tensor dimensions.