Graph Convolutions — The Two Surviving Characterizations

In this chapter

18The Graph Fourier Transform
19Spectral Graph Convolutions
20Polynomial Graph Convolutions
21Connecting Spectral and Polynomial
22The Lost Characterization

18The Graph Fourier Transform

In the previous chapter we built the graph Laplacian $L$ and observed that it is real, symmetric, and positive semidefinite. The spectral theorem guarantees a decomposition:

Eigendecomposition of the Laplacian $$ L = U \Lambda U^T, \quad U = [u_0 \mid u_1 \mid \cdots \mid u_{N-1}], \quad \Lambda = \operatorname{diag}(\lambda_0, \lambda_1, \ldots, \lambda_{N-1}) $$

where $u_0, u_1, \ldots, u_{N-1}$ are orthonormal eigenvectors and $0 = \lambda_0 \leq \lambda_1 \leq \cdots \leq \lambda_{N-1}$ are the corresponding eigenvalues. The columns of $U$ form an orthonormal basis for $\R^N$, meaning $U^T U = U U^T = I$.

This eigendecomposition gives us a new basis in which to express any graph signal $x \in \R^N$. The Graph Fourier Transform (GFT) projects a signal onto these eigenvectors:

Eq. 29 — Graph Fourier Transform $$ \hat{x} = U^T x \tag{29} $$

The coefficient $\hat{x}_k = u_k^T x$ measures how much of the signal $x$ lies along the $k$-th eigenvector $u_k$. To recover the original signal from its transform, we simply invert:

Eq. 30 — Inverse GFT $$ x = U \hat{x} \tag{30} $$

This works because $U$ is orthogonal: $x = UU^T x = U\hat{x}$. The signal $x$ can be reconstructed as a linear combination of eigenvectors: $x = \sum_{k=0}^{N-1} \hat{x}_k \, u_k$.

The eigenvectors of $L$ serve as graph frequencies, with the eigenvalue $\lambda_k$ measuring the "frequency" or oscillation level of each mode:

$u_0$ ($\lambda_0 = 0$): the constant signal $u_0 \propto (1,1,1,1,1,1)^T$. This is the lowest frequency — maximally smooth, with zero variation across every edge. It is always an eigenvector of any connected graph's Laplacian, because $L\mathbf{1} = (D - A)\mathbf{1} = D\mathbf{1} - A\mathbf{1} = d - d = 0$.
$u_1$ ($\lambda_1 \approx 0.44$): a slowly varying signal that takes similar values within each triangle but changes sign across the bridge edge $(2,3)$. This is the Fiedler vector — it reveals the graph's bottleneck.
$u_2, u_3, u_4$ ($\lambda_2 = \lambda_3 = \lambda_4 = 3$): a degenerate eigenspace. These signals oscillate within the triangles, with higher variation than $u_1$.
$u_5$ ($\lambda_5 \approx 4.56$): the highest frequency — maximally oscillatory, with the greatest total variation across edges.

The analogy to the classical DFT is exact in structure: on $\ZN$, the DFT projects a signal onto the complex exponentials $e^{i2\pi m k/N}$, which are the eigenvectors of the shift operator $S$. On a graph, the GFT projects onto the eigenvectors of $L$. The frequencies $\lambda_k$ play the role of the classical frequency indices.

Graph Fourier Transform

Given $L = U\Lambda U^T$, the GFT of a signal $x \in \R^N$ is $\hat{x} = U^T x$. The coefficient $\hat{x}_k = u_k^T x$ measures the component of $x$ along the $k$-th graph frequency $u_k$. The inverse GFT is $x = U\hat{x}$.

Worked example. Consider the signal $x = (1, 1, 0, 0, -1, -1)^T$ on our 6-vertex graph. This signal assigns $+1$ to the first triangle's vertices 0 and 1, $0$ to the bridge vertices 2 and 3, and $-1$ to the second triangle's vertices 4 and 5. Its GFT coefficients $\hat{x} = U^T x$ tell us how this signal decomposes into graph frequencies. The signal has no DC component (it sums to zero, so $\hat{x}_0 = 0$) and loads heavily onto the Fiedler mode $u_1$ because it varies primarily across the bridge.

Figure 5.1. Left: a signal $x = (1,1,0,0,-1,-1)^T$ on the 6-vertex running example, with warm colors for positive values and cool colors for negative. Right: its GFT coefficients $\hat{x}_k$ plotted against the eigenvalues $\lambda_k$. The signal loads heavily onto the Fiedler mode ($\lambda_1 \approx 0.44$) because it varies primarily across the bridge edge.

Physics intuition: The GFT decomposes a graph signal into its "vibration modes." Small eigenvalues correspond to slow, smooth variations (like the fundamental mode of a drum). Large eigenvalues correspond to rapid, oscillatory variations (like high harmonics). The Laplacian eigenvectors are literally the normal modes of a network of springs connected along the graph's edges.

19Spectral Graph Convolutions

With the GFT in hand, we can define graph convolutions by exactly the same recipe used on $\ZN$: transform to the frequency domain, multiply by a spectral response, and transform back.

A spectral graph convolution filters a signal $x$ using a spectral response $\hat{h} = (\hat{h}_0, \hat{h}_1, \ldots, \hat{h}_{N-1})^T$. The filter independently scales each graph frequency:

Eq. 31 — Spectral graph convolution $$ y = U \operatorname{diag}(\hat{h}_0, \ldots, \hat{h}_{N-1})\, U^T x \tag{31} $$

The recipe proceeds in three steps, directly paralleling the classical Fourier filtering pipeline:

GFT: compute $\hat{x} = U^T x$ — project the signal onto graph frequencies.
Multiply: compute $\hat{y}_k = \hat{h}_k \cdot \hat{x}_k$ for each $k$ — scale each frequency independently.
Inverse GFT: compute $y = U \hat{y}$ — map back to the vertex domain.

This is the spectral characterization: graph convolutions are precisely those linear operators that are diagonal in the Laplacian eigenbasis. They act on each graph frequency independently, just as classical convolutions act on each Fourier mode independently.

The choice of $\hat{h}$ determines the filter's behavior. A low-pass filter sets $\hat{h}_k \approx 1$ for small $\lambda_k$ and $\hat{h}_k \approx 0$ for large $\lambda_k$, retaining smooth components and suppressing oscillatory ones. A high-pass filter does the opposite. A band-pass filter retains only an intermediate range.

Figure 5.2. The spectral convolution pipeline: transform to the graph frequency domain via $U^T$, multiply each frequency by the spectral response $\hat{h}_k$, and transform back via $U$. This is structurally identical to classical Fourier filtering on $\ZN$.

Key idea: The recipe is identical to classical Fourier filtering: transform, multiply, invert. Only the basis has changed — from complex exponentials to graph Laplacian eigenvectors.

Example: low-pass filter on the running graph. Consider a spectral response that keeps low frequencies and kills high ones: $\hat{h} = (1, 0.9, 0, 0, 0, 0)^T$. This retains the constant mode fully, keeps 90% of the Fiedler mode, and zeroes out everything at $\lambda \geq 3$. Applied to the signal $x = (1,1,0,0,-1,-1)^T$, this filter would produce a smoothed version that preserves the gross "left-positive, right-negative" structure while eliminating any within-triangle variation.

20Polynomial Graph Convolutions

There is a second, purely algebraic way to define graph convolutions — one that never mentions eigenvalues or eigenvectors. A polynomial graph convolution of degree $d$ is:

Eq. 32 — Polynomial graph convolution $$ p_\theta(L) = \sum_{k=0}^{d} \theta_k L^k = \theta_0 I + \theta_1 L + \theta_2 L^2 + \cdots + \theta_d L^d \tag{32} $$

This is a matrix polynomial in $L$ with learnable coefficients $\theta = (\theta_0, \theta_1, \ldots, \theta_d)^T$. The output for signal $x$ is $y = p_\theta(L)\, x$. The crucial property is locality: the matrix $L^k$ connects $k$-hop neighbors, and nothing farther.

To see why, recall that $L_{ij} \neq 0$ only when $i = j$ or $i \sim j$ (i.e., $i$ and $j$ are adjacent). Therefore $(L^2)_{ij} = \sum_m L_{im} L_{mj}$ is nonzero only when there exists some vertex $m$ that is adjacent to both $i$ and $j$ — meaning $i$ and $j$ are at most 2 hops apart. By induction:

Eq. 33 — Locality of polynomial filters $$ (L^k)_{ij} = 0 \quad \text{if} \quad \operatorname{dist}(i,j) > k \tag{33} $$

A degree-$d$ polynomial filter only mixes information within $d$-hop neighborhoods. This is the algebraic characterization: graph convolutions are polynomials in the graph operator $L$, and the polynomial degree controls the spatial reach.

Figure 5.3. Locality of polynomial filters, illustrated from the perspective of vertex 0. Left: $L^0 = I$ sees only vertex 0 itself. Center: $L^1 = L$ reaches the 1-hop neighbors $\{1, 2\}$. Right: $L^2$ reaches the 2-hop neighborhood $\{1, 2, 3\}$. A degree-$d$ polynomial mixes information only within $d$ hops.

Two advantages of the polynomial view are paramount:

Parameter efficiency. A degree-$d$ polynomial filter has $d+1$ learnable parameters $\theta_0, \ldots, \theta_d$, regardless of the graph size $N$. A spectral filter requires $N$ parameters $\hat{h}_0, \ldots, \hat{h}_{N-1}$ — one per eigenvalue. For a graph with a million nodes, the polynomial filter might use 5 parameters; the spectral filter would need a million.

No eigendecomposition. Computing $p_\theta(L)\, x$ requires only repeated sparse matrix-vector products: compute $Lx$, then $L(Lx) = L^2 x$, and so on, accumulating $\theta_k L^k x$ at each step. Since $L$ is sparse (only $O(|E|)$ nonzeros for $|E|$ edges), each multiplication costs $O(|E|)$, and the total cost is $O(d|E|)$. No eigendecomposition — no $O(N^3)$ bottleneck.

Polynomial Graph Convolution

$p_\theta(L) = \sum_{k=0}^{d} \theta_k L^k$. Parameters: $\theta_0, \ldots, \theta_d$ ($d+1$ numbers). Locality: mixes information only within $d$-hop neighborhoods. Computation: $O(d|E|)$ via sparse matrix-vector products — no eigenvectors needed.

How high can $d$ go? On $\ZN$, the answer was clean: $S^N = I$, so polynomials of degree $N-1$ already span all circulants, and higher powers cycle back. On graphs, $L^k$ never cycles back to $I$ — but the Cayley–Hamilton theorem guarantees that $L^N$ is a linear combination of $I, L, \ldots, L^{N-1}$, so degree $N-1$ is a hard ceiling. In practice the effective bound is lower: if $L$ has $r$ distinct eigenvalues, a degree-$(r{-}1)$ polynomial can interpolate any spectral filter exactly (one free parameter per distinct eigenvalue). In our running example, $r = 4$ (eigenvalues $0, 0.44, 3, 4.56$), so degree 3 already exhausts the polynomial filter space — degrees 4 and 5 add no new expressiveness.

Running example. On our 6-vertex graph, take a degree-1 polynomial $p_\theta(L) = \theta_0 I + \theta_1 L$ with $\theta_0 = 0.5$ and $\theta_1 = -0.2$:

$$ p_\theta(L) = 0.5 \begin{pmatrix} 1&0&0&0&0&0 \\ 0&1&0&0&0&0 \\ 0&0&1&0&0&0 \\ 0&0&0&1&0&0 \\ 0&0&0&0&1&0 \\ 0&0&0&0&0&1 \end{pmatrix} - 0.2 \begin{pmatrix} 2&-1&-1&0&0&0 \\ -1&2&-1&0&0&0 \\ -1&-1&3&-1&0&0 \\ 0&0&-1&3&-1&-1 \\ 0&0&0&-1&2&-1 \\ 0&0&0&-1&-1&2 \end{pmatrix} = \begin{pmatrix} 0.1&0.2&0.2&0&0&0 \\ 0.2&0.1&0.2&0&0&0 \\ 0.2&0.2&-0.1&0.2&0&0 \\ 0&0&0.2&-0.1&0.2&0.2 \\ 0&0&0&0.2&0.1&0.2 \\ 0&0&0&0.2&0.2&0.1 \end{pmatrix} $$

Notice the sparsity pattern: $p_\theta(L)$ is nonzero only where $L$ is nonzero (diagonal and adjacent entries). The filter is local — it mixes each vertex's value only with its direct neighbors.

21Connecting Spectral and Polynomial

The spectral and polynomial characterizations are not independent — they are two views of the same object, connected by a direct algebraic identity. If the spectral filter values are evaluations of a polynomial at the eigenvalues:

Eq. 34 — Spectral-polynomial connection $$ \hat{h}_k = p(\lambda_k) = \sum_{j=0}^{d} \theta_j \lambda_k^j \quad \implies \quad U \operatorname{diag}(\hat{h}) U^T = p_\theta(L) \tag{34} $$

Proof. Expand the spectral form using the polynomial expression for $\hat{h}_k$:

$$ U \operatorname{diag}(p(\lambda_0), \ldots, p(\lambda_{N-1})) U^T = U \operatorname{diag}\!\left(\sum_j \theta_j \lambda_0^j, \ldots, \sum_j \theta_j \lambda_{N-1}^j\right) U^T $$

By linearity of the diagonal:

$$ = \sum_{j=0}^{d} \theta_j \, U \operatorname{diag}(\lambda_0^j, \ldots, \lambda_{N-1}^j) \, U^T = \sum_{j=0}^{d} \theta_j \, (U \Lambda U^T)^j = \sum_{j=0}^{d} \theta_j \, L^j = p_\theta(L) $$

The critical step is recognizing that $U \operatorname{diag}(\lambda_0^j, \ldots, \lambda_{N-1}^j) U^T = U \Lambda^j U^T = (U\Lambda U^T)^j = L^j$. This is the functional calculus of symmetric matrices: for any function $f$, we define $f(L) = U \operatorname{diag}(f(\lambda_0), \ldots, f(\lambda_{N-1})) U^T$.

The equivalence holds in one direction unconditionally: every polynomial filter IS a spectral filter with $\hat{h}_k = p(\lambda_k)$. But the converse fails when eigenvalues repeat.

In our running example, $\lambda_2 = \lambda_3 = \lambda_4 = 3$. A polynomial $p$ evaluated at these three eigenvalues gives the same value: $p(3) = p(3) = p(3)$. But a general spectral filter can assign three different values $\hat{h}_2 \neq \hat{h}_3 \neq \hat{h}_4$ to these three modes. Such a filter cannot be represented as any polynomial in $L$.

The practical lesson: spectral filters are more expressive in principle, but this extra expressiveness is illusory for machine learning. The "extra" spectral filters distinguish between eigenvectors with the same eigenvalue — they break the symmetry of the operator $L$ in arbitrary ways. Polynomial filters are the physically natural subclass.

Key idea: Two views, one object. Spectral filtering tells you what the filter does (which frequencies it passes). Polynomial filtering tells you how to compute it (sparse matrix multiplications). For practical GNNs, the polynomial view wins.

Running example. Take the degree-1 polynomial $p(\lambda) = 0.5 - 0.2\lambda$. Its spectral filter values on our eigenvalues are:

$\hat{h}_0 = p(0) = 0.5$
$\hat{h}_1 = p(0.44) \approx 0.412$
$\hat{h}_2 = \hat{h}_3 = \hat{h}_4 = p(3) = -0.1$
$\hat{h}_5 = p(4.56) \approx -0.412$

This is a low-pass filter: it retains low frequencies ($\hat{h}_0 = 0.5$, $\hat{h}_1 \approx 0.41$) and attenuates or inverts high frequencies ($\hat{h}_5 \approx -0.41$). We can verify that $U\operatorname{diag}(\hat{h})U^T$ equals the matrix $p_\theta(L) = 0.5I - 0.2L$ we computed in the previous section.

22The Lost Characterization

On $\ZN$, we had three equivalent characterizations of convolution:

Spectral: diagonal in the DFT basis.
Algebraic: polynomial in the shift operator $S$.
Group-theoretic: commutes with $S$ (i.e., $TS = ST$).

The equivalence of (2) and (3) relied on a crucial property of $S$: all its eigenvalues are distinct (the $N$-th roots of unity $\rho_0, \rho_1, \ldots, \rho_{N-1}$ (Eq. 11) are all different). When a matrix has distinct eigenvalues, its commutant — the set of all matrices commuting with it — equals its polynomial algebra. This is a standard theorem in linear algebra.

On graphs, this equivalence breaks down. The natural candidate for the "shift" is the Laplacian $L$, and "commutes with $L$" would be the natural graph analog of "commutes with $S$." But when $L$ has repeated eigenvalues, the commutant of $L$ is strictly larger than the polynomial algebra:

Eq. 35 — Commutant vs. polynomial algebra $$ \dim(\text{commutant of } L) \geq \dim(\text{polynomials in } L) \tag{35} $$ with equality if and only if all eigenvalues of $L$ are distinct.

In our running example, $L$ has the eigenvalue $\lambda = 3$ with multiplicity 3. Any matrix that commutes with $L$ must preserve each eigenspace of $L$, but within the 3-dimensional eigenspace for $\lambda = 3$, it can act as an arbitrary $3 \times 3$ matrix. A polynomial $p(L)$ acts as $p(3) \cdot I_3$ on this eigenspace — a scalar multiple of the identity. The commutant is much larger: it includes non-local operators that act nontrivially within the degenerate eigenspace.

Concretely, the commutant of $L$ has dimension $1^2 + 1^2 + 3^2 + 1^2 = 12$ (summing the squares of the multiplicities: $\lambda_0$ has multiplicity 1, $\lambda_1$ has multiplicity 1, $\lambda_{2,3,4}$ have multiplicity 3, and $\lambda_5$ has multiplicity 1). The polynomial algebra has dimension at most equal to the number of distinct eigenvalues, which is 4. So the commutant is a 12-dimensional space, while the polynomial algebra is only 4-dimensional. The 8 "extra" dimensions correspond to operators that commute with $L$ but are not polynomials in it — and these include non-local operators we would not want to call convolutions.

A separate concept sometimes confused with the commutativity condition is permutation equivariance. A polynomial filter $p_\theta(L)$ satisfies:

Eq. 36 — Permutation equivariance $$ p_\theta(PLP^T)(Px) = P\,p_\theta(L)\,x \quad \text{for any permutation matrix } P \tag{36} $$

This says: if you relabel the vertices of the graph (applying permutation $P$ to the adjacency, degree, and Laplacian matrices), the filter output transforms consistently. It is a sanity check — the computation does not depend on the arbitrary numbering of vertices. But it is fundamentally different from the commutativity condition on $\ZN$:

Commutativity ($TS = ST$): one fixed operator $S$, asking if $T$ commutes with it. This is a strong algebraic constraint that, when $S$ has distinct eigenvalues, forces $T$ to be a polynomial in $S$. It gives a complete characterization of convolution.
Permutation equivariance: relabeling the entire graph. It says the filter is "intrinsic" to the graph's structure, not dependent on labeling. But it does not single out polynomial filters — many non-polynomial operators are also permutation-equivariant.

Warning: Permutation equivariance is not the graph analog of shift commutativity. It is a much weaker condition — essentially just saying "the computation does not depend on arbitrary vertex labeling." The group-theoretic characterization of convolutions is genuinely gone on graphs. We retain the spectral and polynomial characterizations, but the algebraic correspondence that made all three equivalent on $\ZN$ is lost when eigenvalues repeat.

On $\ZN$, the rich group structure of the cyclic group gave us three equivalent ways to think about convolution: frequency filtering, polynomial algebra, and symmetry commutation. On general graphs, we keep two of these (and they are still deeply connected, as Section 21 showed). The third — the one identifying convolutions with symmetry-respecting operators — is the one we lose.

What remains is powerful enough: spectral filtering gives us intuition about what a graph convolution does (which frequencies it passes), and polynomial filtering gives us a practical algorithm (sparse matrix products). In the next chapter, we will see how these ideas become neural network layers.