From Spectral Theory to Neural Networks

In this chapter

23The Computational Problem
24ChebNet: Chebyshev Polynomial Filters
25GCN: The Degree-1 Simplification
26Message Passing as a Consequence
27Receptive Fields and Oversmoothing

23The Computational Problem

The spectral convolution $y = U \operatorname{diag}(\hat{h})\, U^T x$ from Chapter 5 is mathematically clean, but computationally disastrous. Three operations dominate the cost:

Computing $U$: eigendecomposition of $L$ costs $O(N^3)$ — this must be done once, but for a graph with millions of nodes it is prohibitive.
Multiplying by $U^T$: a dense $N \times N$ matrix-vector product costs $O(N^2)$.
Multiplying by $U$: another $O(N^2)$ dense product.

Eq. 37 — Spectral cost $$ \text{Spectral: } O(N^3) \text{ setup} + O(N^2) \text{ per signal} \tag{37} $$

For a social network with $N = 10^6$ nodes, the eigendecomposition alone would require $\sim 10^{18}$ operations. Even storing $U$ as a dense matrix would need $10^{12}$ floating-point numbers ($\sim$8 TB at double precision). This is completely impractical.

Polynomial filters $p_\theta(L)\, x = \sum_{k=0}^d \theta_k L^k x$ avoid all of this. The Laplacian $L$ is a sparse matrix: it has at most $2|E| + N$ nonzeros (two per edge, plus the diagonal). The key insight is that we never need to form $L^k$ explicitly. Instead, we compute the sequence of matrix-vector products iteratively:

$z_0 = x$ (cost: free)
$z_1 = Lx$ (cost: $O(|E|)$ — one sparse matrix-vector multiply)
$z_2 = L z_1 = L^2 x$ (cost: $O(|E|)$)
$\vdots$
$z_d = L z_{d-1} = L^d x$ (cost: $O(|E|)$)

The output is $y = \theta_0 z_0 + \theta_1 z_1 + \cdots + \theta_d z_d$, which requires $O(dN)$ additional operations for the weighted sum. The total cost:

Eq. 38 — Polynomial cost $$ \text{Polynomial: } O(d|E|) \text{ per signal, no eigendecomposition} \tag{38} $$

For a sparse graph where $|E| = O(N)$ (e.g., bounded-degree graphs, which are the norm in practice), this becomes $O(dN)$ — linear in the number of nodes, compared to the spectral approach's $O(N^2)$ per signal and $O(N^3)$ setup.

The remaining question is: how should we choose the polynomial coefficients $\theta_0, \ldots, \theta_d$? One approach would be to hand-design them (e.g., approximate a desired spectral response). The machine learning approach is more powerful: make them learnable parameters of a neural network, and let gradient descent find the optimal filter from data.

24ChebNet: Chebyshev Polynomial Filters

Defferrard, Bresson, and Vandergheynst (2016) proposed ChebNet, the first practical spectral graph neural network. Their key insight: parameterize the polynomial filter not in the monomial basis $\{1, x, x^2, \ldots\}$ but in the Chebyshev basis $\{T_0, T_1, T_2, \ldots\}$.

The Chebyshev polynomials of the first kind are defined by the recurrence:

Eq. 39 — Chebyshev recurrence $$ T_0(x) = 1, \quad T_1(x) = x, \quad T_{k+1}(x) = 2x\,T_k(x) - T_{k-1}(x) \tag{39} $$

The first few: $T_0(x) = 1$, $T_1(x) = x$, $T_2(x) = 2x^2 - 1$, $T_3(x) = 4x^3 - 3x$. These polynomials oscillate between $-1$ and $+1$ on the interval $[-1, 1]$, and they are orthogonal with respect to the weight $(1 - x^2)^{-1/2}$.

Figure 6.1. Chebyshev polynomials $T_0$ through $T_3$ on $[-1,1]$. All oscillate between $-1$ and $+1$, providing a numerically stable basis for polynomial approximation. Compare with the monomial basis $\{1, x, x^2, x^3\}$, where $x^k$ grows unboundedly for $|x| > 1$.

Why Chebyshev? The monomial basis $\{I, L, L^2, \ldots\}$ is numerically unstable: the entries of $L^k$ grow exponentially with $k$ (roughly as $\lambda_{\max}^k$), leading to catastrophic cancellation when forming the weighted sum. Chebyshev polynomials, by contrast, stay bounded on $[-1, 1]$, so if we first rescale the eigenvalues to $[-1, 1]$, the computation is numerically stable.

The rescaling maps eigenvalues from $[0, \lambda_{\max}]$ to $[-1, 1]$:

$$ \tilde{L} = \frac{2L}{\lambda_{\max}} - I $$

The eigenvalues of $\tilde{L}$ are $\tilde{\lambda}_k = 2\lambda_k / \lambda_{\max} - 1 \in [-1, 1]$. The ChebNet filter is then:

Eq. 40 — ChebNet filter $$ p_\theta(\tilde{L}) = \sum_{k=0}^{K} \theta_k\, T_k(\tilde{L}), \qquad \tilde{L} = \frac{2L}{\lambda_{\max}} - I \tag{40} $$

The Chebyshev recurrence translates directly to matrices: $T_0(\tilde{L}) = I$, $T_1(\tilde{L}) = \tilde{L}$, and $T_{k+1}(\tilde{L}) = 2\tilde{L}\, T_k(\tilde{L}) - T_{k-1}(\tilde{L})$. Each step requires one sparse matrix-vector multiply (by $\tilde{L}$), so computing $p_\theta(\tilde{L})\, x$ costs $O(K|E|)$ — the same as the monomial basis, but with much better numerical behavior.

The ChebNet architecture stacks these polynomial filters with nonlinearities. Each layer applies a Chebyshev polynomial filter of degree $K$ to each input channel, sums across channels, and applies a pointwise nonlinearity (ReLU). The filter coefficients $\theta_k$ are the learnable parameters, trained by backpropagation. The degree $K$ controls the receptive field: a $K$-degree polynomial mixes information within $K$-hop neighborhoods.

In PyTorch pseudocode, a single ChebNet layer looks like this:

class ChebNetLayer(nn.Module):
    def __init__(self, in_ch, out_ch, K):
        # One set of Chebyshev coefficients per input-output channel pair
        self.theta = nn.Parameter(torch.randn(K + 1, in_ch, out_ch))

    def forward(self, x, L_tilde):
        # x: (N, in_ch)    L_tilde: (N, N) sparse, rescaled to [-1, 1]

        # Build Chebyshev basis signals via recurrence
        T_0 = x                                   # T_0(L̃) x = x
        T_1 = L_tilde @ x                         # T_1(L̃) x = L̃ x
        out = T_0 @ self.theta[0] + T_1 @ self.theta[1]

        T_prev, T_curr = T_0, T_1
        for k in range(2, self.theta.size(0)):
            T_next = 2 * L_tilde @ T_curr - T_prev  # Chebyshev recurrence
            out = out + T_next @ self.theta[k]
            T_prev, T_curr = T_curr, T_next

        return relu(out)                          # (N, out_ch)

The key feature: the only graph operation is the sparse multiply L_tilde @ T_curr, repeated $K$ times. No eigendecomposition is needed, and the cost is $O(K|E|)$ per layer.

ChebNet in practice. Defferrard et al. used just 2 ChebNet layers with a large polynomial degree $K = 25$, followed by graph coarsening and a fully connected layer. On MNIST (reshaped into an 8-nearest-neighbor graph), this achieved $\sim$99.2% accuracy. A single layer of degree $K$ already has a $K$-hop receptive field, so two layers with $K = 25$ can reach 50-hop neighborhoods — far more than is needed for most graphs. This highlights the tradeoff at the heart of the next section: polynomial degree $K$ and network depth $L$ both enlarge the receptive field, with total reach $\sim K \times L$ hops. You can go wide-and-shallow (large $K$, few layers) or narrow-and-deep (small $K$, many layers).

Curiously, later benchmarks on node classification tasks (Cora, Citeseer) found that ChebNet's accuracy actually degrades when $K$ increases beyond $\sim$3. He et al. (2022) showed that ChebNet on Cora drops from 80.5% at $K = 2$ to 74.9% at $K = 10$ — the learned Chebyshev coefficients become unstable at higher orders, a form of spectral overfitting. This finding motivated several fixes (ChebNetII's interpolation reparameterization, Stable-ChebNet's antisymmetric weight constraints), but it also partly explains why Kipf and Welling's radical simplification to $K = 1$ worked so well: for many tasks, a large receptive field per layer is unnecessary, and the instability it introduces can hurt more than the expressiveness helps.

25GCN: The Degree-1 Simplification

Kipf and Welling (2017) made a bold simplification: take ChebNet with $K = 1$ (a degree-1 polynomial) and stack multiple layers instead of using a single high-degree polynomial. This trades filter expressiveness per layer for depth.

With $K = 1$, the filter has only two parameters:

$$ p_\theta(\tilde{L}) = \theta_0\, T_0(\tilde{L}) + \theta_1\, T_1(\tilde{L}) = \theta_0\, I + \theta_1\, \tilde{L} $$

Substituting $\tilde{L} = 2L/\lambda_{\max} - I$:

$$ p_\theta = \theta_0 I + \theta_1 \left(\frac{2L}{\lambda_{\max}} - I\right) = (\theta_0 - \theta_1) I + \frac{2\theta_1}{\lambda_{\max}} L $$

Kipf and Welling apply a series of further simplifications. First, they approximate $\lambda_{\max} \approx 2$ (a reasonable approximation for many graphs, and exact for bipartite graphs). This gives:

$$ p_\theta \approx (\theta_0 - \theta_1) I + \theta_1 L = (\theta_0 - \theta_1) I + \theta_1 (D - A) = \theta_0 I - \theta_1 A + \theta_1 D $$

Since $L = D - A$, we have $I - L = I - D + A$. They further reduce to a single parameter by setting $\theta_0 = -\theta_1 = \theta$, giving:

$$ y = \theta(I + D^{-1/2} A D^{-1/2})\, x $$

where $D^{-1/2} A D^{-1/2}$ is the symmetrically normalized adjacency matrix (entry $(i,j)$ is $1/\sqrt{d_i d_j}$ if $i \sim j$, and 0 otherwise). Its eigenvalues lie in $[-1, 1]$, and $I + D^{-1/2} A D^{-1/2}$ has eigenvalues in $[0, 2]$.

The final ingredient is the renormalization trick: replace $I + D^{-1/2} A D^{-1/2}$ with the normalized adjacency of the graph-with-self-loops:

Eq. 41 — Renormalization trick $$ \tilde{A} = A + I_N, \quad \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \tag{41} $$

Adding self-loops to every vertex ($\tilde{A} = A + I_N$) means each vertex is now connected to itself. The updated degree matrix is $\tilde{D} = D + I_N$. The renormalized propagation matrix $\tilde{D}^{-1/2} \tilde{A}\, \tilde{D}^{-1/2}$ has better spectral properties than $I + D^{-1/2} A D^{-1/2}$ (eigenvalues in $[0, 1]$ for connected graphs, avoiding the problematic eigenvalue at 2).

The full GCN layer, operating on multi-channel node features $H^{(\ell)} \in \R^{N \times F}$ (where $F$ is the number of features per node), is:

Eq. 42 — GCN layer $$ H^{(\ell+1)} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\, H^{(\ell)}\, W^{(\ell)}\right) \tag{42} $$

Here $W^{(\ell)} \in \R^{F_\ell \times F_{\ell+1}}$ is a learnable weight matrix that mixes the $F_\ell$ input features into $F_{\ell+1}$ output features, and $\sigma$ is a pointwise nonlinearity (typically ReLU). The graph-dependent part $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ is fixed (computed once from the graph); only $W^{(\ell)}$ is learned.

GCN Layer (Kipf & Welling 2017)

$\tilde{A} = A + I$, $\tilde{D}$ = degree matrix of $\tilde{A}$, then $H^{(\ell+1)} = \sigma(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\, H^{(\ell)}\, W^{(\ell)})$. This is a degree-1 polynomial in the normalized Laplacian with self-loops. Each layer aggregates 1-hop neighbor features, transforms with a learnable weight, and applies a nonlinearity. Stacking $K$ layers gives a $K$-hop receptive field.

In PyTorch pseudocode:

class GCNLayer(nn.Module):
    def __init__(self, in_ch, out_ch):
        self.W = nn.Parameter(torch.randn(in_ch, out_ch))

    def forward(self, H, A_hat):
        # H: (N, in_ch)    A_hat: (N, N) sparse, = D̃^{-1/2} Ã D̃^{-1/2}
        return relu(A_hat @ H @ self.W)           # (N, out_ch)


# Precompute the normalized adjacency (done once)
A_tilde = A + torch.eye(N)                     # add self-loops
D_tilde = torch.diag(A_tilde.sum(1))
D_inv_sqrt = torch.diag(1.0 / A_tilde.sum(1).sqrt())
A_hat = D_inv_sqrt @ A_tilde @ D_inv_sqrt      # fixed, not learned

The entire graph-dependent computation is the single sparse multiply A_hat @ H — a weighted average of neighbor features. The learnable part is the dense multiply by W, which is graph-independent. Compare with ChebNet: GCN trades the $K$-step Chebyshev recurrence for a single propagation step, relying on depth (stacking layers) instead of filter degree for multi-hop reach.

The progression from theory to practice follows a clear chain of simplifications: general spectral filter ($N$ parameters, $O(N^3)$ cost) $\to$ polynomial filter ($d+1$ parameters, $O(d|E|)$ cost) $\to$ Chebyshev polynomial ($K+1$ parameters, numerically stable) $\to$ degree-1 Chebyshev with renormalization (weight matrix $W$ parameters, $O(|E|)$ per layer).

26Message Passing as a Consequence

Let us unpack what the GCN layer actually computes for each individual node. Writing out row $i$ of the matrix equation $H^{(\ell+1)} = \sigma(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(\ell)} W^{(\ell)})$:

Eq. 43 — GCN per-node update $$ h_i^{(\ell+1)} = \sigma\!\left(\sum_{j \in \mathcal{N}(i) \cup \{i\}} \frac{1}{\sqrt{\tilde{d}_i\, \tilde{d}_j}}\, h_j^{(\ell)}\, W^{(\ell)}\right) \tag{43} $$

Here $\mathcal{N}(i)$ is the set of neighbors of $i$ in the original graph, $\tilde{d}_i = d_i + 1$ is the degree in the self-loop-augmented graph, and $h_j^{(\ell)} \in \R^{F_\ell}$ is the feature vector of node $j$ at layer $\ell$. The sum includes $j = i$ because of the self-loops.

This computation has a natural interpretation as message passing:

AGGREGATE: Each node $i$ collects ("receives messages from") the features of its neighbors and itself. The weighting $1/\sqrt{\tilde{d}_i \tilde{d}_j}$ normalizes by the geometric mean of the degrees — high-degree nodes send weaker messages to prevent them from dominating.
TRANSFORM: The aggregated features are multiplied by the learnable weight matrix $W^{(\ell)}$, which mixes and transforms the feature channels.
ACTIVATE: A pointwise nonlinearity $\sigma$ (e.g., ReLU) is applied, introducing the non-linearity needed for the network to learn complex functions.

Figure 6.2. The GCN update for a single node $i$: (1) aggregate degree-normalized features from neighbors and self, (2) transform by the learned weight matrix $W$, (3) apply a pointwise nonlinearity $\sigma$.

Spectral origins, pragmatic arrival. The path from spectral convolutions to the GCN layer passed through several approximations ($\lambda_{\max} \approx 2$, collapsing two parameters to one, the renormalization trick) that each lack a strong a priori justification. Message passing was not deduced from spectral theory — it was reached by a chain of simplifications that happen to work well in practice. Still, the spectral starting point is what suggested the form: without polynomial filters in the Laplacian, there would be no reason to expect that "average your neighbors' features and apply a linear map" is a sensible layer design.

Many papers present message passing as a starting axiom ("nodes should aggregate information from their neighbors"). The spectral perspective does not derive this axiom, but it does provide a useful lens: the GCN layer is the vertex-domain implementation of a degree-1 polynomial filter in the Laplacian, and the aggregation weights $1/\sqrt{\tilde{d}_i \tilde{d}_j}$ come from the symmetric normalization, not from an ad hoc design choice.

Worked Example: A Full GCN Layer on the Running Graph

Let us trace a complete forward pass through one GCN layer on our 6-vertex graph. This is the "two-triangles-with-bridge" graph with edges $\{0,1\}$, $\{0,2\}$, $\{1,2\}$, $\{2,3\}$, $\{3,4\}$, $\{3,5\}$, $\{4,5\}$.

Step 1: Input features. Suppose each node has a 2-dimensional feature vector. The input feature matrix $H^{(0)} \in \R^{6 \times 2}$:

$$ H^{(0)} = \begin{pmatrix} 1.0 & 0.0 \\ 0.5 & 0.5 \\ 0.0 & 1.0 \\ 1.0 & 1.0 \\ 0.0 & 0.5 \\ 0.5 & 0.0 \end{pmatrix} $$

Step 2: Compute $\tilde{A} = A + I$. Add self-loops to the adjacency matrix:

$$ A = \begin{pmatrix} 0&1&1&0&0&0 \\ 1&0&1&0&0&0 \\ 1&1&0&1&0&0 \\ 0&0&1&0&1&1 \\ 0&0&0&1&0&1 \\ 0&0&0&1&1&0 \end{pmatrix} \quad \Rightarrow \quad \tilde{A} = \begin{pmatrix} 1&1&1&0&0&0 \\ 1&1&1&0&0&0 \\ 1&1&1&1&0&0 \\ 0&0&1&1&1&1 \\ 0&0&0&1&1&1 \\ 0&0&0&1&1&1 \end{pmatrix} $$

Step 3: Compute $\tilde{D}$. The degree matrix of $\tilde{A}$: each diagonal entry is the original degree plus 1 (from the self-loop).

$$ \tilde{D} = \operatorname{diag}(3, 3, 4, 4, 3, 3) $$

Step 4: Compute $\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$. We have $\tilde{D}^{-1/2} = \operatorname{diag}(1/\sqrt{3},\, 1/\sqrt{3},\, 1/2,\, 1/2,\, 1/\sqrt{3},\, 1/\sqrt{3})$. The normalized matrix entries are:

Self-loops: $\hat{A}_{ii} = \tilde{A}_{ii} / \tilde{d}_i = 1/\tilde{d}_i$. So $\hat{A}_{00} = 1/3$, $\hat{A}_{22} = 1/4$, etc.
Off-diagonal: $\hat{A}_{ij} = 1/\sqrt{\tilde{d}_i \tilde{d}_j}$ if $i \sim j$. For example, $\hat{A}_{01} = 1/\sqrt{3 \cdot 3} = 1/3$, and $\hat{A}_{23} = 1/\sqrt{4 \cdot 4} = 1/4$.

$$ \hat{A} = \begin{pmatrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{2\sqrt{3}} & 0 & 0 & 0 \\[4pt] \frac{1}{3} & \frac{1}{3} & \frac{1}{2\sqrt{3}} & 0 & 0 & 0 \\[4pt] \frac{1}{2\sqrt{3}} & \frac{1}{2\sqrt{3}} & \frac{1}{4} & \frac{1}{4} & 0 & 0 \\[4pt] 0 & 0 & \frac{1}{4} & \frac{1}{4} & \frac{1}{2\sqrt{3}} & \frac{1}{2\sqrt{3}} \\[4pt] 0 & 0 & 0 & \frac{1}{2\sqrt{3}} & \frac{1}{3} & \frac{1}{3} \\[4pt] 0 & 0 & 0 & \frac{1}{2\sqrt{3}} & \frac{1}{3} & \frac{1}{3} \end{pmatrix} $$

where $1/(2\sqrt{3}) \approx 0.2887$.

Step 5: Aggregate. Compute $Z = \hat{A}\, H^{(0)}$. For node 0, this is:

$$ z_0 = \frac{1}{3}\, h_0 + \frac{1}{3}\, h_1 + \frac{1}{2\sqrt{3}}\, h_2 = \frac{1}{3}(1.0, 0.0) + \frac{1}{3}(0.5, 0.5) + \frac{1}{2\sqrt{3}}(0.0, 1.0) $$ $$ = (0.333 + 0.167 + 0,\; 0 + 0.167 + 0.289) = (0.500,\; 0.456) $$

Computing all rows:

$$ Z = \hat{A}\, H^{(0)} \approx \begin{pmatrix} 0.500 & 0.456 \\ 0.500 & 0.456 \\ 0.433 & 0.644 \\ 0.625 & 0.644 \\ 0.456 & 0.500 \\ 0.456 & 0.500 \end{pmatrix} $$

Notice the smoothing effect: the features are now more similar across neighboring vertices than the original $H^{(0)}$.

Step 6: Transform. Choose a weight matrix $W^{(0)} \in \R^{2 \times 3}$ (mapping 2 input features to 3 output features):

$$ W^{(0)} = \begin{pmatrix} 1.0 & 0.5 & -0.5 \\ -0.5 & 1.0 & 0.5 \end{pmatrix} $$

Compute $Z W^{(0)}$. For node 0:

$$ z_0 W^{(0)} = (0.500, 0.456) \begin{pmatrix} 1.0 & 0.5 & -0.5 \\ -0.5 & 1.0 & 0.5 \end{pmatrix} = (0.500 - 0.228,\; 0.250 + 0.456,\; -0.250 + 0.228) = (0.272,\; 0.706,\; -0.022) $$

The full pre-activation matrix:

$$ Z W^{(0)} \approx \begin{pmatrix} 0.272 & 0.706 & -0.022 \\ 0.272 & 0.706 & -0.022 \\ 0.111 & 0.861 & -0.105 \\ 0.303 & 0.957 & -0.009 \\ 0.206 & 0.728 & 0.022 \\ 0.206 & 0.728 & 0.022 \end{pmatrix} $$

Step 7: Activate. Apply ReLU (set negative values to zero):

$$ H^{(1)} = \operatorname{ReLU}(Z W^{(0)}) \approx \begin{pmatrix} 0.272 & 0.706 & 0.000 \\ 0.272 & 0.706 & 0.000 \\ 0.111 & 0.861 & 0.000 \\ 0.303 & 0.957 & 0.000 \\ 0.206 & 0.728 & 0.022 \\ 0.206 & 0.728 & 0.022 \end{pmatrix} $$

This is the output of one GCN layer: each node now has a 3-dimensional feature vector that incorporates information from its 1-hop neighborhood. The third feature channel is nearly zero everywhere (the filter learned to suppress it in this direction), while the first two channels carry meaningful information.

Notice that nodes 0 and 1 have identical output features (they have the same neighborhood structure and the same aggregated input). Similarly, nodes 4 and 5 share the same output. This reflects the graph's symmetry — the GCN respects the automorphisms of the graph.

27Receptive Fields and Oversmoothing

Each GCN layer aggregates features from 1-hop neighbors (and self). Stacking $K$ layers gives each node access to its $K$-hop neighborhood — analogous to how a degree-$K$ polynomial in $L$ reaches $K$ hops. This is the receptive field of the network.

In our running example, vertex 0 is at distance 3 from vertex 5 (path: 0$\to$2$\to$3$\to$5). So after 3 GCN layers, vertex 0's feature incorporates information from vertex 5. After 2 layers, it does not. For a 6-vertex graph, 3 layers suffice to give every node a global view (the diameter is 3).

But there is a problem. The aggregation step applies the normalized adjacency matrix $\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$, which is a smoothing operator: it replaces each node's features with a weighted average of its neighbors' features. Iterated application of a smoothing operator converges to a uniform state:

Mathematically, the eigenvalues of $\hat{A}$ lie in $[0, 1]$ for connected graphs with self-loops. The largest eigenvalue is $\lambda = 1$ with eigenvector proportional to $(\sqrt{\tilde{d}_0}, \sqrt{\tilde{d}_1}, \ldots, \sqrt{\tilde{d}_{N-1}})^T$. All other eigenvalues have $|\lambda| < 1$. Therefore:

$$ \hat{A}^k \xrightarrow{k \to \infty} \text{projection onto the dominant eigenvector} $$

After many layers of aggregation, all node features converge to (approximately) the same vector. This is oversmoothing — the fundamental limitation of deep GNNs.

This is not a new phenomenon — it is a classical result about random walks on graphs. The matrix $\hat{A}$ is (up to the symmetric normalization) the transition matrix of a lazy random walk: at each step, a walker at node $i$ moves to a random neighbor or stays put. The stationary distribution is reached when the walk “mixes” — forgets where it started. The rate of mixing is controlled by the spectral gap $1 - \lambda_2$, where $\lambda_2$ is the second-largest eigenvalue of $\hat{A}$. A large spectral gap means fast mixing (features homogenize quickly, so even shallow GCNs oversmooth), while a small gap means slow mixing (more layers before information is lost). Oversmoothing in GCNs is, precisely, the Markov chain reaching stationarity.

In our worked example, notice how after just one layer, nodes 0 and 1 already had identical features, and nodes 4 and 5 were identical. After two or three more layers, all six nodes would have nearly the same feature vector — at which point the node features contain no useful information for distinguishing vertices.

Oversmoothing means that simply stacking more GCN layers does not always improve performance. Beyond a certain depth (typically 2-4 layers for many benchmarks), additional layers actually hurt because they wash out node-level information. This is in stark contrast to CNNs on images, where dozens or hundreds of layers are standard.

The most direct remedy is skip connections, inspired by ResNet. Instead of the standard GCN update, add a residual path:

Eq. 44 — Skip connection $$ h_i^{(\ell+1)} = h_i^{(\ell)} + \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\, H^{(\ell)}\, W^{(\ell)}\right)_i \tag{44} $$

The additive $h_i^{(\ell)}$ ensures that the original features are always accessible, preventing them from being completely averaged away. Other approaches to oversmoothing include:

JK-Net (Xu et al. 2018): concatenate features from all layers and let the model learn which layer's representation is most useful for each node.
GCNII (Chen et al. 2020): skip connections to the initial features $H^{(0)}$ (not just the previous layer), plus identity mapping: $H^{(\ell+1)} = \sigma(((1-\alpha)\hat{A}H^{(\ell)} + \alpha H^{(0)})((1-\beta)I + \beta W^{(\ell)}))$.
DropEdge (Rong et al. 2020): randomly remove edges during training, slowing down the convergence to the uniform state.
PairNorm (Zhao & Akoglu 2020): normalize node features at each layer to maintain variance.

The oversmoothing phenomenon has a clean spectral interpretation. The GCN propagation is a low-pass filter (it preserves the low-frequency components and attenuates the high-frequency ones). Stacking layers applies this low-pass filter repeatedly, progressively killing off all frequencies except the DC component. The remedy is to either (a) preserve high-frequency information through skip connections, or (b) limit the depth of the network and compensate with wider layers.

This completes the journey from spectral theory to practical neural networks. We began with the graph Laplacian's eigendecomposition, defined spectral convolutions, discovered that polynomial filters are computationally tractable, specialized to Chebyshev and then degree-1 polynomials, and arrived at the GCN layer — which, unpacked, is simply message passing. Along the way, we saw the fundamental tradeoff: depth gives larger receptive fields, but too much depth destroys information through oversmoothing.

In the next chapter, we will step back and survey the complete chain of analogies between $\ZN$ and general graphs, assess what was preserved and what was lost in the transition, and preview the modern landscape of graph neural networks that has grown from these foundations.