Chapter 02

Message Passing on Graphs

From feedforward networks to graph neural networks — the GCN layer deconstructed with a full numerical walkthrough on our running example.

In this chapter
  1. 03Quick Review: Feedforward Neural Networks
  2. 04From Feedforward to Graph
  3. 05Building the Normalized Adjacency
  4. 06Full Numerical Walkthrough
  5. 07Stacking Layers and the Receptive Field
  6. 08The GCN-Laplacian Connection

03Quick Review: Feedforward Neural Networks

Before diving into GNNs, let's recall the simplest neural network architecture — the feedforward network (multilayer perceptron). If this is already familiar, skim through; the purpose is to establish the notation and identify exactly what changes when we move to graphs.

A Single Feedforward Layer

Given an input vector $\mathbf{x} \in \mathbb{R}^{d_{\text{in}}}$, one layer computes:

$$\mathbf{y} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$$

where $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ is a learnable weight matrix, $\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}$ is a learnable bias, and $\sigma$ is a nonlinear activation function (ReLU, tanh, etc.).

A Feedforward Layer: Wx + b → σ Input x x₁ x₂ x₃ W (weights) Σ + b Σ Σ σ(·) ReLU ReLU y₁ y₂ Output y The key ingredients: 1. Linear transform (W, b) 2. Nonlinearity (σ) What's missing? Each input is processed independently. There's no communication between nodes.
Figure 3.1. A single feedforward layer transforms each input independently: linear map + nonlinearity. If we applied this to each vertex of a graph, vertex 0 would have no idea what vertex 1's features look like. A GNN adds exactly one ingredient: aggregation from neighbors.

The critical observation: a feedforward network applied to graph vertices would process each vertex in isolation. Vertex $v_0$ at 50°C would have no idea that its neighbor $v_1$ is at 20°C. This is useless for tasks like heat conduction or social influence — where the whole point is that neighbors affect each other.


04From Feedforward to Graph: Adding Neighborhood Aggregation

A graph neural network adds one fundamental operation before the linear transform: aggregate features from neighbors. Here is the general message passing framework:

GNN Message Passing — One Layer (General Form)
$$\mathbf{h}_v^{(\ell+1)} = \phi\!\left(\mathbf{h}_v^{(\ell)},\; \bigoplus_{w \in \mathcal{N}(v)} \psi\!\left(\mathbf{h}_v^{(\ell)}, \mathbf{h}_w^{(\ell)}\right)\right)$$

Three components:

$\psi$ = message function: computes a "message" from neighbor $w$ to node $v$
$\bigoplus$ = aggregation: combines all incoming messages (sum, mean, max, ...)
$\phi$ = update function: combines old features with aggregated messages to produce new features

This looks abstract, so let's see the most concrete and widely-used instantiation: the Graph Convolutional Network (GCN) of Kipf & Welling (2017).

The GCN Layer — Making it Concrete

In a GCN, the message, aggregation, and update functions are collapsed into a single elegant matrix operation:

GCN Layer (Kipf & Welling, 2017)
$$\mathbf{H}^{(\ell+1)} = \sigma\!\left(\hat{\mathbf{A}} \, \mathbf{H}^{(\ell)} \, \mathbf{W}^{(\ell)}\right)$$

where:

$\mathbf{H}^{(\ell)} \in \mathbb{R}^{n \times d_\ell}$ = feature matrix (row $v$ = feature vector of vertex $v$ at layer $\ell$)
$\mathbf{W}^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}}$ = learnable weight matrix (shared across all vertices)
$\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ = normalized adjacency with self-loops
$\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ = adjacency + self-loops,   $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$
$\sigma$ = activation function (typically ReLU)

Connecting to feedforward intuition: Think of the feature matrix $\mathbf{H}$ as a batch of $n$ samples, one per vertex. Without any graph structure ($\mathbf{A} = 0$, so $\hat{\mathbf{A}} = \mathbf{I}$), the GCN equation reduces to $\sigma(\mathbf{I} \cdot \mathbf{H} \cdot \mathbf{W}) = \sigma(\mathbf{H}\mathbf{W})$ — each row of $\mathbf{H}$ gets multiplied by $\mathbf{W}$ independently. That's just a standard feedforward layer applied to each vertex as if it were a separate sample in a batch. No vertex knows any other vertex exists.

Now add edges. The matrix $\hat{\mathbf{A}}$ is no longer the identity — it has off-diagonal entries wherever edges exist. Multiplying $\hat{\mathbf{A}}\mathbf{H}$ mixes the rows of $\mathbf{H}$ together before the linear transform. It's as if the samples in your batch started talking to each other: "Hey, I'm at 50°C, what about you?" The graph structure determines who talks to whom, and the normalization determines how much each neighbor's voice counts.

In short: a GCN is a feedforward network where the batch samples are allowed to exchange information along edges before each layer.

The other extreme — a fully connected graph: What if every node is connected to every other? All degrees equal $n$, so the symmetric normalization gives $\hat{A}_{ij} = 1/n$ for every entry. Now $\hat{\mathbf{A}}\mathbf{H}$ replaces each vertex's features with the uniform average over all vertices. Every vertex gets the same aggregated representation — all local structure is erased.

Physicists will recognize this as the mean field approximation. In the Ising model, each spin interacts with specific neighbors through the lattice. Mean field theory replaces those local interactions with a single effective field: the average over all spins. This is mathematically equivalent to putting the system on a fully connected graph — every spin talks to every other spin equally. The approximation becomes exact in the limit of infinite neighbors (infinite dimensions), precisely because the true local field converges to the global average.

This gives us three regimes to reason about what $\hat{\mathbf{A}}$ does:

No edges ($\hat{\mathbf{A}} = \mathbf{I}$): independent particles — each vertex ignores all others
Sparse graph ($\hat{\mathbf{A}}$ has structure): real interactions — local neighborhood determines each vertex's update
Fully connected ($\hat{\mathbf{A}} = \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$): mean field — every vertex sees the same global average, local structure is lost

The power of GNNs lies in the middle regime: the graph encodes which interactions matter, preserving local structure that mean field theory deliberately throws away.

Let's unpack what each piece does. Reading the equation left to right:

Reading the GCN Equation: σ( Â · H · W ) ① Â · H Mix each vertex with its neighbors via normalized weighted average This is the graph part! structure-aware aggregation ② (result) · W Transform each vertex's aggregated features Same as feedforward: linear map per vertex d_in → d_out features ③ σ(·) Apply nonlinearity (ReLU: keep positives, zero out negatives) Same as feedforward: element-wise activation
Figure 4.1. The three steps of a GCN layer. Step ① is the only step that uses the graph structure — it mixes each vertex's features with its neighbors'. Step ② is a standard feedforward transform (same weights for every vertex). Step ③ is a standard nonlinearity. Note: because matrix multiplication is associative, $\hat{\mathbf{A}}(\mathbf{H}\mathbf{W}) = (\hat{\mathbf{A}}\mathbf{H})\mathbf{W}$ — you could equally transform first and aggregate second with the same result. The aggregate-first order shown here is the standard convention.

05Building $\hat{\mathbf{A}}$: The Normalized Adjacency

Before we can run a GCN layer, we need to construct $\hat{\mathbf{A}}$. Let's build it step by step for our Figure 2.1 graph.

Step 1: Add Self-Loops

Without self-loops, the aggregation would only look at neighbors, forgetting the vertex's own features. Adding $\mathbf{I}$ ensures each vertex also "sends a message to itself":

A (adjacency) [0 1 1 1] [1 0 1 0] [1 1 0 1] [1 0 1 0] + I (identity) [1 0 0 0] [0 1 0 0] [0 0 1 0] [0 0 0 1] = Ã = A + I [1 1 1 1] [1 1 1 0] [1 1 1 1] [1 0 1 1] Row sums → d̃: [4, 3, 4, 3] (original degree + 1 for self-loop)
Figure 5.1. Adding self-loops. The diagonal of $\mathbf{A}$ was zero (a vertex isn't its own neighbor); adding $\mathbf{I}$ puts 1s on the diagonal. Now each row sums to $\deg(v) + 1$.

Step 2: Symmetric Normalization

Without normalization, high-degree vertices would dominate (they sum more neighbors). The symmetric normalization $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ produces a weighted average where each edge is downweighted by both endpoint degrees:

Normalization coefficient
$$\hat{A}_{ij} = \frac{\tilde{A}_{ij}}{\sqrt{\tilde{d}_i}\sqrt{\tilde{d}_j}}$$

What does $\mathbf{D}^{-1/2}$ actually mean? This notation looks intimidating, but for a diagonal matrix it's trivial. A diagonal matrix $\mathbf{D}$ only has nonzero entries on the diagonal: $D_{ii} = d_i$. Every matrix operation reduces to a scalar operation on each diagonal entry independently:

$\mathbf{D}^{1/2}$: take the square root → $D^{1/2}_{ii} = \sqrt{d_i}$
$\mathbf{D}^{-1}$: take the reciprocal → $D^{-1}_{ii} = 1/d_i$
$\mathbf{D}^{-1/2}$: combine both → $D^{-1/2}_{ii} = 1/\sqrt{d_i}$

For our graph with $\tilde{d} = (4, 3, 4, 3)$:

$\tilde{\mathbf{D}}^{-1/2} = \text{diag}\!\left(\frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}},\; \frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}}\right) = \text{diag}(0.500,\; 0.577,\; 0.500,\; 0.577)$

This only works because $\mathbf{D}$ is diagonal. For a general matrix, $\mathbf{M}^{-1/2}$ requires eigendecomposition — but degree matrices are always diagonal, so we never need that here.

Building  for Our Graph Normalization coefficients: Â(i,j) = 1 / (√d̃ᵢ · √d̃ⱼ) when Ã(i,j) = 1 Â(0,0) = 1/(√4·√4) = 1/4 = 0.250 (self) Â(0,1) = 1/(√4·√3) = 1/2√3 = 0.289 (v0↔v1) Â(0,2) = 1/(√4·√4) = 1/4 = 0.250 (v0↔v2) Â(0,3) = 1/(√4·√3) = 1/2√3 = 0.289 (v0↔v3) Row sum for v0: 0.250 + 0.289 + 0.250 + 0.289 = 1.077 ≈ 1 (approximate average) v0 v1 v2 v3 v0 [.250 .289 .250 .289] v1 [.289 .333 .289 0 ] v2 [.250 .289 .250 .289] v3 [.289 0 .289 .333]
Figure 5.2. The normalized adjacency matrix $\hat{\mathbf{A}}$. Each entry is $1/\sqrt{\tilde{d}_i \tilde{d}_j}$ where a connection exists, 0 otherwise. Row sums are approximately 1, so $\hat{\mathbf{A}}$ acts as a soft averaging operator. The degree-3 vertices (v1, v3) get slightly higher self-weight (0.333 vs 0.250) because they have fewer neighbors to dilute over.

Why symmetric? Left-multiplying by $\tilde{\mathbf{D}}^{-1}$ alone (i.e., simple row-normalization) would give an exact row-stochastic average, but it breaks the symmetry of the operator. The symmetric form $\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ keeps $\hat{\mathbf{A}}$ symmetric, which preserves the spectral theory from Chapter 1 — the eigenvalues are real, the eigenvectors are orthogonal, and the connection to the Laplacian ($\mathbf{L}_0 = \mathbf{I} - \hat{\mathbf{A}}_{\text{(without self-loops)}}$) is maintained.


06Full Numerical Walkthrough: One GCN Layer

Now let's run a complete GCN layer on our graph, tracking every number. We'll use 2-dimensional input features (for readability) and map to 3-dimensional outputs.

Setup: Initial Features

Assign each vertex a 2D feature vector — think of these as (normalized temperature, thermal conductivity):

Initial Vertex Features H⁰ ∈ ℝ⁴ˣ² 0 1 2 3 1.0, 0.2 0.4, 0.8 0.7, 0.3 0.2, 0.9 Feature matrix H⁰ f₁ f₂ v₀ [1.0 0.2] v₁ [0.4 0.8] v₂ [0.7 0.3] v₃ [0.2 0.9]
Figure 6.1. The initial features. Each vertex carries a 2D vector. The feature matrix $\mathbf{H}^{(0)}$ stacks all vertex features into a $4 \times 2$ matrix.

Step 1: Aggregate — $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$

Multiplying $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$ computes a normalized weighted average of each vertex's own features and its neighbors' features. Let's trace vertex $v_0$ in full detail:

Detailed: Computing the Aggregated Feature for v₀ 0 1 2 3 v₀ receives from all 3 neighbors + itself Aggregated feature for v₀ = Row 0 of  · H⁰ ĥ(v₀) = 0.250·h(v₀) + 0.289·h(v₁) + 0.250·h(v₂) + 0.289·h(v₃) self: 0.250 × [1.0, 0.2] = [0.250, 0.050] from v₁: 0.289 × [0.4, 0.8] = [0.115, 0.231] from v₂: 0.250 × [0.7, 0.3] = [0.175, 0.075] from v₃: 0.289 × [0.2, 0.9] = [0.058, 0.260] SUM: = [0.598, 0.616] The input feature for v₀ was [1.0, 0.2]. After aggregation: [0.598, 0.616]. → Feature 1 dropped (diluted by lower-valued neighbors) → Feature 2 rose (pulled up by higher-valued neighbors)
Figure 6.2. Tracing vertex $v_0$ through the aggregation step. Each neighbor contributes its feature vector, weighted by the normalization coefficient. The result is a smoothed feature that blends $v_0$'s own values with its neighborhood — exactly the kind of local averaging that makes GNNs powerful for tasks like heat prediction or community detection.

Repeating for all vertices, the full aggregated matrix is:

Aggregated features: $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$
$$\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)} = \begin{pmatrix} 0.598 & 0.616 \\ 0.624 & 0.411 \\ 0.598 & 0.616 \\ 0.557 & 0.444 \end{pmatrix}$$

Notice that vertices $v_0$ and $v_2$ get identical aggregated features. This makes sense — they have the same degree (3) and the same set of neighbors ($\{v_0/v_2, v_1, v_3\}$ plus self), so the weighted average is the same. This is a feature of GCN's symmetric normalization: structurally equivalent vertices get identical representations.

Step 2: Linear Transform — $(\hat{\mathbf{A}} \mathbf{H}^{(0)}) \cdot \mathbf{W}$

The weight matrix $\mathbf{W}^{(0)} \in \mathbb{R}^{2 \times 3}$ maps from 2 features to 3 features. These are the learnable parameters — the only part of the GCN that training adjusts:

Step 2: Linear Transform (·H⁰)·W Weight matrix W⁰ (2×3) — learned during training: W = [ 0.5 −0.3 0.8] ← row for input feature 1 [−0.2 0.7 0.4] ← row for input feature 2 For vertex v₀: ĥ(v₀) · W = [0.598, 0.616] · W out₁ = 0.598×(0.5) + 0.616×(−0.2) = 0.299 − 0.123 = 0.176 out₂ = 0.598×(−0.3) + 0.616×(0.7) = −0.179 + 0.431 = 0.252
Figure 6.3. The linear transform projects the aggregated 2D features into a 3D output space. The same weight matrix is applied at every vertex — this is the parameter sharing that makes GNNs efficient and equivariant.

Step 3: Activation — $\sigma(\cdot)$

Apply ReLU element-wise: $\text{ReLU}(x) = \max(0, x)$. In our example, all pre-activation values happen to be positive, so nothing gets zeroed:

Final output: $\mathbf{H}^{(1)} = \text{ReLU}(\hat{\mathbf{A}} \mathbf{H}^{(0)} \mathbf{W}^{(0)})$
$$\mathbf{H}^{(1)} = \text{ReLU}\begin{pmatrix} 0.176 & 0.252 & 0.725 \\ 0.230 & 0.101 & 0.664 \\ 0.176 & 0.252 & 0.725 \\ 0.190 & 0.144 & 0.624 \end{pmatrix} = \begin{pmatrix} 0.176 & 0.252 & 0.725 \\ 0.230 & 0.101 & 0.664 \\ 0.176 & 0.252 & 0.725 \\ 0.190 & 0.144 & 0.624 \end{pmatrix}$$

Each vertex now has a 3-dimensional feature vector that encodes both its own original features and information from its neighbors. Vertex $v_0$ "knows" something about $v_1$, $v_2$, and $v_3$, even though it never saw their features directly — it received them through the aggregation step.

The Complete Pipeline — One Picture

One GCN Layer: The Complete Data Flow H⁰ (4×2) 1.0 0.2 0.4 0.8 0.7 0.3 0.2 0.9 Â× Â·H⁰ (4×2) 0.60 0.62 0.62 0.41 0.60 0.62 0.56 0.44 ×W (·H⁰)·W (4×3) .176 .252 .725 .230 .101 .664 .176 .252 .725 .190 .144 .624 ReLU H¹ (4×3) .176 .252 .725 .230 .101 .664 .176 .252 .725 .190 .144 .624 Each vertex is isolated Neighbors are mixed in Projected to new dimension Nonlinearity applied After 1 layer, each vertex knows about its immediate neighbors (1-hop). After 2 layers → 2-hop. After k layers → k-hop. This is the "receptive field."
Figure 6.4. The complete data flow of one GCN layer. The 4×2 input is aggregated (mixed with neighbors), projected (2D → 3D by the learned weight matrix), and passed through ReLU. After one layer, each vertex's representation encodes its 1-hop neighborhood.

07Stacking Layers and the Receptive Field

A single layer gives each vertex access to its immediate neighbors. What happens with multiple layers?

Receptive Field Growth: How Information Propagates Layer 0 (input) 0 v₀ sees: only itself After 1 layer 0 v₀ sees: 1-hop (v1,v2,v3) After 2 layers 0 v₀ sees: entire graph On this graph: Diameter = 2 So 2 layers suffice for global info. Too many layers → "oversmoothing"
Figure 7.1. Receptive field growth. After $k$ GCN layers, each vertex has aggregated information from all vertices within $k$ hops. On our small graph (diameter 2), two layers already give every vertex a global view. On large graphs, this creates the classic GNN trade-off: more layers = wider receptive field, but also more risk of "oversmoothing" where all vertex representations converge.

A Larger Example: The Two-Community Graph

Our 4-node graph has diameter 2, so the receptive field saturates after just 2 layers — every vertex already sees everything. Let's see what happens on a larger graph where depth genuinely matters: a 32-node "two-community" graph with diameter ≈ 7.

Figure 7.2 — Two-Community Graph (32 nodes, ~60 edges) Cluster A (v₀–v₁₅) Cluster B (v₁₆–v₃₁) v₀ v₇ v₁₂ v₁₆ v₂₄ bridge bridge 16 nodes per cluster · avg degree ≈ 4 within cluster · 2 bridge edges between clusters Diameter ≈ 7 — must traverse within cluster + cross bridge + traverse other cluster
Figure 7.2. A two-community graph with 32 vertices. Each cluster has dense internal connections (avg degree ~4), but only two bridge edges connect the clusters. Target vertex $v_0$ (highlighted, deep inside cluster A) would need 7+ hops to reach the far side of cluster B. This makes the receptive field concept non-trivial: 3 layers is not enough for $v_0$ to see the whole graph.

Architecture: A 3-Layer GCN (1 → 3 → 3 → 1)

Each vertex starts with a single scalar feature (e.g., temperature). We stack three GCN layers in an "expand → refine → compress" pattern:

Figure 7.3 — 3-Layer GCN Pipeline: Tensor Shapes Input H⁰ 32×1 Â 32×1 W⁰ 32×3 σ Layer 1 32×3 Â 32×3 32×3 σ Layer 2 32×3 Â 32×3 32×1 σ Output 32×1 Layer 1: Expand (1→3) Layer 2: Refine (3→3) Layer 3: Compress (3→1)
Figure 7.3. Data flow through 3 GCN layers. Each layer applies the same $\hat{\mathbf{A}}$ (the 32×32 normalized adjacency — fixed, not learned) followed by a different $\mathbf{W}^{(\ell)}$ (learned). The weight matrices control the dimension changes: $\mathbf{W}^{(0)} \in \mathbb{R}^{1 \times 3}$ expands from scalar to 3D, $\mathbf{W}^{(1)} \in \mathbb{R}^{3 \times 3}$ refines within 3D, and $\mathbf{W}^{(2)} \in \mathbb{R}^{3 \times 1}$ compresses back to a scalar prediction.

Tracking $v_0$'s Receptive Field Across 3 Layers

Let's follow vertex $v_0$ (deep inside cluster A) and track exactly which vertices contribute to its representation after each layer:

Figure 7.4 — Receptive Field of v₀ After Each Layer Input (Layer 0) v₀ 1 vertex v₀ knows only itself After Layer 1 v₀ ~5 vertices 1-hop, all in cluster A After Layer 2 v₀ ~12–15 vertices 2-hop, still cluster A only After Layer 3 v₀ ~18–20 vertices 3-hop, just crosses bridge After 3 layers, v₀ has just reached the bridge vertex v₇ → v₁₆ → barely touching cluster B. With diameter ≈ 7, v₀ would need 7+ layers to see the entire graph. 3 layers ≈ 50% coverage. Contrast with our 4-node graph: diameter 2 meant 2 layers gave 100% coverage. Here, 3 layers only covers about half the graph — and that's often a feature, not a bug. For tasks like "predict v₀'s community label," local information may be all you need.
Figure 7.4. Receptive field of $v_0$ after each GCN layer on the two-community graph. Layer 1 reaches direct neighbors (~5 vertices). Layer 2 extends to 2-hop (~12–15 vertices, still entirely within cluster A). Layer 3 finally reaches the bridge vertex $v_7$, crossing into cluster B for the first time. With diameter ≈ 7, three layers give only partial coverage — and for many tasks, this local view is sufficient and even preferable to a global one.

What the hidden dimension does: The 1 → 3 → 3 → 1 funnel isn't arbitrary. Layer 1 expands each scalar to a 3D vector so the network can learn three different "views" of the aggregated neighborhood — perhaps one channel capturing magnitude, another gradient, another curvature. Layer 2 mixes these views with the next ring of neighbors' 3D representations, allowing richer feature interactions. Layer 3 compresses back to a scalar prediction. The hidden dimension (3 here) controls the network's capacity — its ability to represent different mixing patterns. Too small and it can't distinguish neighborhoods; too large and it overfits.

Feature Smoothing and the Heat Diffusion Connection

Each application of $\hat{\mathbf{A}}$ averages a vertex's features with its neighbors' — this is literally a diffusion step. Let's visualize what happens to the feature distribution across vertices as we stack layers:

Figure 7.5 — Feature Smoothing Across Layers Layer 0 (input) vertex features: High variance Features are diverse After Layer 1 vertex features: Reduced variance Local neighborhoods smooth After Layer 2 vertex features: Low variance Cluster-level patterns After Layer 3 vertex features: Very low variance Differences washing out Physics connection: this IS heat diffusion. The aggregation $\hat{\mathbf{A}}\mathbf{H}$ diffuses signals across the graph — each layer is one time step. The learnable weights $\mathbf{W}$ choose what to amplify or suppress at each step, but $\hat{\mathbf{A}}$ always smooths. Without $\mathbf{W}$, stacking $k$ layers would give $\hat{\mathbf{A}}^k\mathbf{H}$ — literally the heat kernel at time $k$ on the graph.
Figure 7.5. Feature smoothing across layers. Each bar shows one vertex's feature value. Layer 0: features are diverse (original signal). Layer 1: local averaging reduces variance within neighborhoods. Layer 2: broader smoothing produces cluster-level patterns. Layer 3: inter-cluster differences begin to erode. The learnable weights $\mathbf{W}$ partially counteract this smoothing by selectively amplifying useful patterns, but the trend toward uniformity is inherent in repeated averaging.

The oversmoothing problem: If we stacked 10 layers on this 32-node graph, every vertex would see the entire graph multiple times over. All representations would converge toward the graph-wide average — vertices in cluster A would become indistinguishable from vertices in cluster B. This is the oversmoothing problem, and it's why most practical GCNs use only 2–4 layers. The learnable weights $\mathbf{W}$ can slow the convergence but cannot fully prevent it when the number of layers far exceeds the graph diameter.

Summary: Small Graph vs. Large Graph

Aspect 4-node example (§6) 32-node example (§7)
Graph diameter 2 ≈ 7
Layers to saturate receptive field 2 7+
Layers used 1 3
Coverage after all layers 100% (global) ≈ 50% (still local)
Feature dimensions 2 → 3 1 → 3 → 3 → 1
Key lesson GCN mechanics Depth vs. receptive field trade-off

08The GCN–Laplacian Connection

There's a deep connection between the GCN layer and the Laplacian from Chapter 1. The original (unnormalized) adjacency acts as a diffusion step, and the GCN normalization is a specific choice of diffusion rate:

GCN as a spectral filter
$$\hat{\mathbf{A}} = \mathbf{I} - \hat{\mathbf{L}}_0 \qquad \text{where}\qquad \hat{\mathbf{L}}_0 = \mathbf{I} - \tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$$

So $\hat{\mathbf{A}}\mathbf{H} = (\mathbf{I} - \hat{\mathbf{L}}_0)\mathbf{H} = \mathbf{H} - \hat{\mathbf{L}}_0\mathbf{H}$. This is exactly one step of heat diffusion: the output is the input minus a Laplacian-weighted correction. High-frequency components (large eigenvalues of $\hat{\mathbf{L}}_0$) get attenuated more — the GCN layer is a low-pass filter.

The bridge to TDL: A GCN layer uses $\hat{\mathbf{A}}$ (derived from the graph Laplacian $\mathbf{L}_0$) to pass messages between vertices along edges. In Chapters 3–5, we'll build boundary operators $\mathbf{B}_{1,2}$, $\mathbf{B}_{2,3}$, ... that connect edges to faces to volumes. The higher-order message passing framework replaces $\hat{\mathbf{A}}$ with Hodge Laplacians and incidence matrices, enabling messages to flow between objects of any rank — not just between adjacent vertices. That's the core innovation of topological deep learning.