03Quick Review: Feedforward Neural Networks
Before diving into GNNs, let's recall the simplest neural network architecture — the feedforward network (multilayer perceptron). If this is already familiar, skim through; the purpose is to establish the notation and identify exactly what changes when we move to graphs.
Given an input vector $\mathbf{x} \in \mathbb{R}^{d_{\text{in}}}$, one layer computes:
$$\mathbf{y} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$$where $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ is a learnable weight matrix, $\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}$ is a learnable bias, and $\sigma$ is a nonlinear activation function (ReLU, tanh, etc.).
The critical observation: a feedforward network applied to graph vertices would process each vertex in isolation. Vertex $v_0$ at 50°C would have no idea that its neighbor $v_1$ is at 20°C. This is useless for tasks like heat conduction or social influence — where the whole point is that neighbors affect each other.
04From Feedforward to Graph: Adding Neighborhood Aggregation
A graph neural network adds one fundamental operation before the linear transform: aggregate features from neighbors. Here is the general message passing framework:
Three components:
$\psi$ = message function: computes a "message" from neighbor $w$ to node $v$
$\bigoplus$ = aggregation: combines all incoming messages (sum, mean, max, ...)
$\phi$ = update function: combines old features with aggregated messages to produce new features
This looks abstract, so let's see the most concrete and widely-used instantiation: the Graph Convolutional Network (GCN) of Kipf & Welling (2017).
The GCN Layer — Making it Concrete
In a GCN, the message, aggregation, and update functions are collapsed into a single elegant matrix operation:
where:
$\mathbf{H}^{(\ell)} \in \mathbb{R}^{n \times d_\ell}$ = feature matrix (row $v$ = feature vector of vertex $v$ at layer $\ell$)
$\mathbf{W}^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}}$ = learnable weight matrix (shared across all vertices)
$\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ = normalized adjacency with self-loops
$\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ = adjacency + self-loops, $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$
$\sigma$ = activation function (typically ReLU)
Connecting to feedforward intuition: Think of the feature matrix $\mathbf{H}$ as a batch of $n$ samples, one per vertex. Without any graph structure ($\mathbf{A} = 0$, so $\hat{\mathbf{A}} = \mathbf{I}$), the GCN equation reduces to $\sigma(\mathbf{I} \cdot \mathbf{H} \cdot \mathbf{W}) = \sigma(\mathbf{H}\mathbf{W})$ — each row of $\mathbf{H}$ gets multiplied by $\mathbf{W}$ independently. That's just a standard feedforward layer applied to each vertex as if it were a separate sample in a batch. No vertex knows any other vertex exists.
Now add edges. The matrix $\hat{\mathbf{A}}$ is no longer the identity — it has off-diagonal entries wherever edges exist. Multiplying $\hat{\mathbf{A}}\mathbf{H}$ mixes the rows of $\mathbf{H}$ together before the linear transform. It's as if the samples in your batch started talking to each other: "Hey, I'm at 50°C, what about you?" The graph structure determines who talks to whom, and the normalization determines how much each neighbor's voice counts.
In short: a GCN is a feedforward network where the batch samples are allowed to exchange information along edges before each layer.
The other extreme — a fully connected graph: What if every node is connected to every other? All degrees equal $n$, so the symmetric normalization gives $\hat{A}_{ij} = 1/n$ for every entry. Now $\hat{\mathbf{A}}\mathbf{H}$ replaces each vertex's features with the uniform average over all vertices. Every vertex gets the same aggregated representation — all local structure is erased.
Physicists will recognize this as the mean field approximation. In the Ising model, each spin interacts with specific neighbors through the lattice. Mean field theory replaces those local interactions with a single effective field: the average over all spins. This is mathematically equivalent to putting the system on a fully connected graph — every spin talks to every other spin equally. The approximation becomes exact in the limit of infinite neighbors (infinite dimensions), precisely because the true local field converges to the global average.
This gives us three regimes to reason about what $\hat{\mathbf{A}}$ does:
No edges ($\hat{\mathbf{A}} = \mathbf{I}$): independent particles — each vertex ignores all others
Sparse graph ($\hat{\mathbf{A}}$ has structure): real interactions — local neighborhood determines each vertex's update
Fully connected ($\hat{\mathbf{A}} = \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$): mean field — every vertex sees the same global average, local structure is lost
The power of GNNs lies in the middle regime: the graph encodes which interactions matter, preserving local structure that mean field theory deliberately throws away.
Let's unpack what each piece does. Reading the equation left to right:
05Building $\hat{\mathbf{A}}$: The Normalized Adjacency
Before we can run a GCN layer, we need to construct $\hat{\mathbf{A}}$. Let's build it step by step for our Figure 2.1 graph.
Step 1: Add Self-Loops
Without self-loops, the aggregation would only look at neighbors, forgetting the vertex's own features. Adding $\mathbf{I}$ ensures each vertex also "sends a message to itself":
Step 2: Symmetric Normalization
Without normalization, high-degree vertices would dominate (they sum more neighbors). The symmetric normalization $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ produces a weighted average where each edge is downweighted by both endpoint degrees:
What does $\mathbf{D}^{-1/2}$ actually mean? This notation looks intimidating, but for a diagonal matrix it's trivial. A diagonal matrix $\mathbf{D}$ only has nonzero entries on the diagonal: $D_{ii} = d_i$. Every matrix operation reduces to a scalar operation on each diagonal entry independently:
$\mathbf{D}^{1/2}$: take the square root → $D^{1/2}_{ii} = \sqrt{d_i}$
$\mathbf{D}^{-1}$: take the reciprocal → $D^{-1}_{ii} = 1/d_i$
$\mathbf{D}^{-1/2}$: combine both → $D^{-1/2}_{ii} = 1/\sqrt{d_i}$
For our graph with $\tilde{d} = (4, 3, 4, 3)$:
$\tilde{\mathbf{D}}^{-1/2} = \text{diag}\!\left(\frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}},\; \frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}}\right) = \text{diag}(0.500,\; 0.577,\; 0.500,\; 0.577)$
This only works because $\mathbf{D}$ is diagonal. For a general matrix, $\mathbf{M}^{-1/2}$ requires eigendecomposition — but degree matrices are always diagonal, so we never need that here.
Why symmetric? Left-multiplying by $\tilde{\mathbf{D}}^{-1}$ alone (i.e., simple row-normalization) would give an exact row-stochastic average, but it breaks the symmetry of the operator. The symmetric form $\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ keeps $\hat{\mathbf{A}}$ symmetric, which preserves the spectral theory from Chapter 1 — the eigenvalues are real, the eigenvectors are orthogonal, and the connection to the Laplacian ($\mathbf{L}_0 = \mathbf{I} - \hat{\mathbf{A}}_{\text{(without self-loops)}}$) is maintained.
06Full Numerical Walkthrough: One GCN Layer
Now let's run a complete GCN layer on our graph, tracking every number. We'll use 2-dimensional input features (for readability) and map to 3-dimensional outputs.
Setup: Initial Features
Assign each vertex a 2D feature vector — think of these as (normalized temperature, thermal conductivity):
Step 1: Aggregate — $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$
Multiplying $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$ computes a normalized weighted average of each vertex's own features and its neighbors' features. Let's trace vertex $v_0$ in full detail:
Repeating for all vertices, the full aggregated matrix is:
Notice that vertices $v_0$ and $v_2$ get identical aggregated features. This makes sense — they have the same degree (3) and the same set of neighbors ($\{v_0/v_2, v_1, v_3\}$ plus self), so the weighted average is the same. This is a feature of GCN's symmetric normalization: structurally equivalent vertices get identical representations.
Step 2: Linear Transform — $(\hat{\mathbf{A}} \mathbf{H}^{(0)}) \cdot \mathbf{W}$
The weight matrix $\mathbf{W}^{(0)} \in \mathbb{R}^{2 \times 3}$ maps from 2 features to 3 features. These are the learnable parameters — the only part of the GCN that training adjusts:
Step 3: Activation — $\sigma(\cdot)$
Apply ReLU element-wise: $\text{ReLU}(x) = \max(0, x)$. In our example, all pre-activation values happen to be positive, so nothing gets zeroed:
Each vertex now has a 3-dimensional feature vector that encodes both its own original features and information from its neighbors. Vertex $v_0$ "knows" something about $v_1$, $v_2$, and $v_3$, even though it never saw their features directly — it received them through the aggregation step.
The Complete Pipeline — One Picture
07Stacking Layers and the Receptive Field
A single layer gives each vertex access to its immediate neighbors. What happens with multiple layers?
A Larger Example: The Two-Community Graph
Our 4-node graph has diameter 2, so the receptive field saturates after just 2 layers — every vertex already sees everything. Let's see what happens on a larger graph where depth genuinely matters: a 32-node "two-community" graph with diameter ≈ 7.
Architecture: A 3-Layer GCN (1 → 3 → 3 → 1)
Each vertex starts with a single scalar feature (e.g., temperature). We stack three GCN layers in an "expand → refine → compress" pattern:
Tracking $v_0$'s Receptive Field Across 3 Layers
Let's follow vertex $v_0$ (deep inside cluster A) and track exactly which vertices contribute to its representation after each layer:
What the hidden dimension does: The 1 → 3 → 3 → 1 funnel isn't arbitrary. Layer 1 expands each scalar to a 3D vector so the network can learn three different "views" of the aggregated neighborhood — perhaps one channel capturing magnitude, another gradient, another curvature. Layer 2 mixes these views with the next ring of neighbors' 3D representations, allowing richer feature interactions. Layer 3 compresses back to a scalar prediction. The hidden dimension (3 here) controls the network's capacity — its ability to represent different mixing patterns. Too small and it can't distinguish neighborhoods; too large and it overfits.
Feature Smoothing and the Heat Diffusion Connection
Each application of $\hat{\mathbf{A}}$ averages a vertex's features with its neighbors' — this is literally a diffusion step. Let's visualize what happens to the feature distribution across vertices as we stack layers:
The oversmoothing problem: If we stacked 10 layers on this 32-node graph, every vertex would see the entire graph multiple times over. All representations would converge toward the graph-wide average — vertices in cluster A would become indistinguishable from vertices in cluster B. This is the oversmoothing problem, and it's why most practical GCNs use only 2–4 layers. The learnable weights $\mathbf{W}$ can slow the convergence but cannot fully prevent it when the number of layers far exceeds the graph diameter.
Summary: Small Graph vs. Large Graph
| Aspect | 4-node example (§6) | 32-node example (§7) |
|---|---|---|
| Graph diameter | 2 | ≈ 7 |
| Layers to saturate receptive field | 2 | 7+ |
| Layers used | 1 | 3 |
| Coverage after all layers | 100% (global) | ≈ 50% (still local) |
| Feature dimensions | 2 → 3 | 1 → 3 → 3 → 1 |
| Key lesson | GCN mechanics | Depth vs. receptive field trade-off |
08The GCN–Laplacian Connection
There's a deep connection between the GCN layer and the Laplacian from Chapter 1. The original (unnormalized) adjacency acts as a diffusion step, and the GCN normalization is a specific choice of diffusion rate:
So $\hat{\mathbf{A}}\mathbf{H} = (\mathbf{I} - \hat{\mathbf{L}}_0)\mathbf{H} = \mathbf{H} - \hat{\mathbf{L}}_0\mathbf{H}$. This is exactly one step of heat diffusion: the output is the input minus a Laplacian-weighted correction. High-frequency components (large eigenvalues of $\hat{\mathbf{L}}_0$) get attenuated more — the GCN layer is a low-pass filter.
The bridge to TDL: A GCN layer uses $\hat{\mathbf{A}}$ (derived from the graph Laplacian $\mathbf{L}_0$) to pass messages between vertices along edges. In Chapters 3–5, we'll build boundary operators $\mathbf{B}_{1,2}$, $\mathbf{B}_{2,3}$, ... that connect edges to faces to volumes. The higher-order message passing framework replaces $\hat{\mathbf{A}}$ with Hodge Laplacians and incidence matrices, enabling messages to flow between objects of any rank — not just between adjacent vertices. That's the core innovation of topological deep learning.