Chapter 3

From Kernels to Fourier Space

The convolution theorem, translation-invariant kernels, and the key insight: learn the kernel directly in Fourier space, never in physical space.

In this chapter
  1. 13Brief Fourier Refresher
  2. 14Translation-Invariant Kernels
  3. 15The Key Insight (Definition 3)
  4. 16What R Looks Like Concretely
  5. 17Mode Truncation

13Brief Fourier Refresher

The Fourier transform decomposes a function into its frequency components. For a function $f: D \to \R$ on a periodic domain, the continuous Fourier transform pair is:

Fourier Transform Pair
$$ \hat{f}(k) = (\Ft f)(k) = \int_D f(x)\, e^{-2\pi i \langle k, x \rangle}\, dx $$ $$ f(x) = (\Fti \hat{f})(x) = \sum_{k \in \Z^d} \hat{f}(k)\, e^{2\pi i \langle k, x \rangle} $$

Here $k \in \Z^d$ is the frequency (or mode) index. Low $|k|$ corresponds to smooth, slowly-varying components; high $|k|$ corresponds to rapidly oscillating components.

The single most important fact for this chapter:

Convolution Theorem

Convolution in physical space is equivalent to pointwise multiplication in Fourier space:

$$ \Ft(f * g) = \Ft(f) \cdot \Ft(g) $$

Equivalently: $f * g = \Fti\bigl(\Ft(f) \cdot \Ft(g)\bigr)$.

Why does this matter? A convolution integral $\int f(x-y)\,g(y)\,dy$ costs $O(n^2)$ in physical space. But by the convolution theorem, we can compute it as: FFT both functions ($O(n\log n)$), multiply pointwise ($O(n)$), inverse FFT ($O(n\log n)$). Total: $O(n \log n)$ instead of $O(n^2)$.


14Translation-Invariant Kernels

Recall from Definition 2 that the kernel integral is:

$$ (\mathcal{K}\, v_t)(x) = \int_D \kappa(x, y, a(x), a(y))\, v_t(y)\, dy. $$

This kernel depends on both $x$ and $y$ independently, plus $a(x)$ and $a(y)$. Now the paper makes a simplifying assumption: what if the kernel depends only on the relative position $x - y$?

$$ \kappa(x, y) \;\to\; \kappa(x - y) $$

This is called translation invariance: the kernel doesn't care about absolute position, only the displacement between points. With this simplification, the integral becomes a convolution:

$$ (\mathcal{K}\, v_t)(x) = \int_D \kappa(x - y)\, v_t(y)\, dy = (\kappa * v_t)(x). $$

And by the convolution theorem:

$$ \mathcal{K}\, v_t = \Fti\!\bigl(\Ft(\kappa) \cdot \Ft(v_t)\bigr). $$
Physical space κ * vₜ O(n²) = Fourier space vₜ FFT F̂(vₜ) × F̂(κ) IFFT K vₜ O(n log n) + O(n) + O(n log n) = O(n log n) Translation invariance turns the integral into a convolution, enabling the FFT shortcut.
Figure 3.1. The convolution theorem at work: an $O(n^2)$ convolution in physical space becomes $O(n\log n)$ via FFT → pointwise multiply → IFFT.

Note the trade-off: we dropped the dependence on $a(x)$ and $a(y)$ from the kernel. This means the kernel can no longer adapt to the input function. However, the $W\,v_t(x)$ path (the local, pointwise path from Definition 1) still sees the lifted representation of $a$, so information about $a$ is not entirely lost.


15The Key Insight (Definition 3)

The convolution theorem says we could learn $\kappa$ in physical space, Fourier-transform it, multiply, and inverse-transform. But the paper makes a bolder move:

Key idea: Don't learn $\kappa$ in physical space at all. Learn the Fourier-space representation $R_\varphi$ directly. The kernel $\kappa$ never needs to exist — we parameterize $R = \Ft(\kappa)$ as a learnable tensor and work entirely in frequency space.

Definition 3: Fourier Integral Operator
$$ \bigl(\mathcal{K}(\varphi)\, v_t\bigr)(x) = \Fti\!\Bigl(R_\varphi \cdot (\Ft\, v_t)\Bigr)(x) \tag{4} $$

where $R_\varphi$ is a learnable complex-valued tensor parameterized by $\varphi$. For each frequency mode $k$: $R_\varphi(k) \in \mathbb{C}^{\dv \times \dv}$, and $(\Ft\, v_t)(k) \in \mathbb{C}^{\dv}$. The "multiplication" $R \cdot \Ft(v_t)$ is a matrix–vector product at each frequency mode.

Let's unpack the computation step by step:

  1. Fourier transform $v_t$: At each frequency mode $k$, get $(\Ft\, v_t)(k) \in \mathbb{C}^{\dv}$ — a complex vector of $\dv$ channels.
  2. Multiply by $R_\varphi(k)$: At each mode $k$, apply the learned $\dv \times \dv$ complex matrix $R_\varphi(k)$ to the vector $(\Ft\, v_t)(k)$. This mixes channels at that frequency.
  3. Inverse Fourier transform: Transform back to physical space to get $(\mathcal{K}\, v_t)(x)$.

The crucial point: $R_\varphi$ is defined on frequency modes, not on spatial grid points. Its shape depends on how many modes we keep (a hyperparameter), not on the grid resolution $n$. This is the source of resolution invariance.


16What R Looks Like Concretely

For 2D Darcy flow with $\dv = 32$ channels and $k_{\max} = 12$ retained modes in each spatial dimension:

$$ R_\varphi \in \mathbb{C}^{k_{\max} \times k_{\max} \times \dv \times \dv} = \mathbb{C}^{12 \times 12 \times 32 \times 32}. $$

At each of the $12 \times 12 = 144$ retained frequency modes, $R$ is a $32 \times 32$ complex-valued matrix. This matrix mixes channels — it decides how the 32 features interact at that particular frequency.

Parameter count per layer: Since each complex number has a real and imaginary part, the total number of real-valued parameters in $R$ is:

$$ 2 \times 12 \times 12 \times 32 \times 32 = 294{,}912 \text{ parameters.} $$

Add $W \in \R^{32 \times 32} = 1{,}024$ parameters for the local path, and each Fourier layer has about 296K parameters total. With 4 layers, that's roughly 1.2M parameters — small by modern standards.

R ∈ C¹²×¹²×³²×³² k₁ modes (12) k₂ modes (12) R(k) 32 × 32 complex channel-mixing matrix × F(vₜ)(k) C³² = C³² At each of 144 frequency modes: a 32×32 complex matrix multiplies a 32-dim complex vector.
Figure 3.2. The tensor $R_\varphi$ has shape $(12, 12, 32, 32)$ complex. At each frequency mode $k = (k_1, k_2)$, there is a $32 \times 32$ complex matrix that mixes channels by multiplying the Fourier coefficient vector $(\Ft\,v_t)(k) \in \mathbb{C}^{32}$.

Mini numerical example

To make this completely concrete, consider a tiny case with $\dv = 2$ channels and $k_{\max} = 2$ (keeping 4 modes total in 1D). The $R$ tensor is:

$$ R(k) = \begin{pmatrix} r_{11}(k) & r_{12}(k) \\ r_{21}(k) & r_{22}(k) \end{pmatrix} \in \mathbb{C}^{2 \times 2}, \qquad k = 0, 1. $$

Suppose at mode $k=1$:

$$ R(1) = \begin{pmatrix} 0.5 + 0.1i & -0.3 + 0.2i \\ 0.1 - 0.4i & 0.6 + 0.0i \end{pmatrix}, \qquad (\Ft\, v_t)(1) = \begin{pmatrix} 1.0 + 0.5i \\ -0.2 + 0.8i \end{pmatrix}. $$

Then:

$$ R(1) \cdot (\Ft\, v_t)(1) = \begin{pmatrix} (0.5+0.1i)(1.0+0.5i) + (-0.3+0.2i)(-0.2+0.8i) \\ (0.1-0.4i)(1.0+0.5i) + (0.6+0.0i)(-0.2+0.8i) \end{pmatrix}. $$

Each entry is a complex multiply-and-accumulate — standard linear algebra, just with complex numbers instead of reals. This happens independently at each frequency mode $k$.


17Mode Truncation

On a $64 \times 64$ grid, the FFT produces $64 \times 64 = 4096$ frequency modes. But FNO only learns $R$ for the lowest $k_{\max}$ modes in each dimension. The paper defines the retained set:

$$ \Z_{k_{\max}} = \bigl\{k \in \Z^d : |k_j| \leq k_{\max,j} \text{ for } j = 1, \ldots, d\bigr\}. $$

With $k_{\max} = 12$ on a $64 \times 64$ grid, we keep the $12 \times 12 = 144$ lowest-frequency modes and discard the remaining $4096 - 144 = 3952$ high-frequency modes. The high-frequency Fourier coefficients are set to zero.

k₁ (64 modes) k₂ (64 modes) kept 12×12 discarded (set to 0) high-frequency modes Low-frequency modes sit at the corner of the FFT output
Figure 3.3. Mode truncation: of the $64 \times 64$ FFT modes, only the $12 \times 12$ lowest-frequency modes (blue box) are retained. The learned tensor $R$ exists only at these 144 modes. All other modes are set to zero.

Why does this work? Two reasons:

Physics intuition: This is exactly the physicist's intuition about spectral methods. Smooth fields (temperature, pressure, low-Reynolds fluid flow) are well-described by a handful of Fourier modes. Turbulent flows need more modes; shocks and discontinuities need many. FNO works best when the solution is smooth — which is precisely the regime where spectral methods excel.

Mode truncation also explains resolution invariance. The tensor $R$ is defined on the $12 \times 12$ lowest modes. If we evaluate on a $128 \times 128$ grid instead of $64 \times 64$, the FFT produces $128 \times 128$ modes, but we still only touch the same $12 \times 12$ lowest ones. The same $R$ works on any grid — the mode indices don't change.

We now have all the ingredients. In the next chapter, we combine Definition 1 (the iterative update) with Definition 3 (the Fourier integral operator) into the complete FNO layer, and trace a full forward pass with explicit tensor dimensions.