Message Passing on <strong>Graphs</strong> — Topological Deep Learning Course

In this chapter

03Quick Review: Feedforward Neural Networks
04From Feedforward to Graph
05Building the Normalized Adjacency
06Full Numerical Walkthrough
07Stacking Layers and the Receptive Field
08The GCN-Laplacian Connection

03Quick Review: Feedforward Neural Networks

Before diving into GNNs, let's recall the simplest neural network architecture — the feedforward network (multilayer perceptron). If this is already familiar, skim through; the purpose is to establish the notation and identify exactly what changes when we move to graphs.

A Single Feedforward Layer

Given an input vector $\mathbf{x} \in \mathbb{R}^{d_{\text{in}}}$, one layer computes:

$$\mathbf{y} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$$

where $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ is a learnable weight matrix, $\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}$ is a learnable bias, and $\sigma$ is a nonlinear activation function (ReLU, tanh, etc.).

Figure 3.1. A single feedforward layer transforms each input independently: linear map + nonlinearity. If we applied this to each vertex of a graph, vertex 0 would have no idea what vertex 1's features look like. A GNN adds exactly one ingredient: aggregation from neighbors.

The critical observation: a feedforward network applied to graph vertices would process each vertex in isolation. Vertex $v_0$ at 50°C would have no idea that its neighbor $v_1$ is at 20°C. This is useless for tasks like heat conduction or social influence — where the whole point is that neighbors affect each other.

04From Feedforward to Graph: Adding Neighborhood Aggregation

A graph neural network adds one fundamental operation before the linear transform: aggregate features from neighbors. Here is the general message passing framework:

GNN Message Passing — One Layer (General Form)

$$\mathbf{h}_v^{(\ell+1)} = \phi\!\left(\mathbf{h}_v^{(\ell)},\; \bigoplus_{w \in \mathcal{N}(v)} \psi\!\left(\mathbf{h}_v^{(\ell)}, \mathbf{h}_w^{(\ell)}\right)\right)$$

Three components:

$\psi$ = message function: computes a "message" from neighbor $w$ to node $v$
$\bigoplus$ = aggregation: combines all incoming messages (sum, mean, max, ...)
$\phi$ = update function: combines old features with aggregated messages to produce new features

This looks abstract, so let's see the most concrete and widely-used instantiation: the Graph Convolutional Network (GCN) of Kipf & Welling (2017).

The GCN Layer — Making it Concrete

In a GCN, the message, aggregation, and update functions are collapsed into a single elegant matrix operation:

GCN Layer (Kipf & Welling, 2017)

$$\mathbf{H}^{(\ell+1)} = \sigma\!\left(\hat{\mathbf{A}} \, \mathbf{H}^{(\ell)} \, \mathbf{W}^{(\ell)}\right)$$

where:

$\mathbf{H}^{(\ell)} \in \mathbb{R}^{n \times d_\ell}$ = feature matrix (row $v$ = feature vector of vertex $v$ at layer $\ell$)
$\mathbf{W}^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}}$ = learnable weight matrix (shared across all vertices)
$\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ = normalized adjacency with self-loops
$\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ = adjacency + self-loops, $\tilde{\mathbf{D}}_{ii} = \sum_j \tilde{A}_{ij}$
$\sigma$ = activation function (typically ReLU)

Connecting to feedforward intuition: Think of the feature matrix $\mathbf{H}$ as a batch of $n$ samples, one per vertex. Without any graph structure ($\mathbf{A} = 0$, so $\hat{\mathbf{A}} = \mathbf{I}$), the GCN equation reduces to $\sigma(\mathbf{I} \cdot \mathbf{H} \cdot \mathbf{W}) = \sigma(\mathbf{H}\mathbf{W})$ — each row of $\mathbf{H}$ gets multiplied by $\mathbf{W}$ independently. That's just a standard feedforward layer applied to each vertex as if it were a separate sample in a batch. No vertex knows any other vertex exists.

Now add edges. The matrix $\hat{\mathbf{A}}$ is no longer the identity — it has off-diagonal entries wherever edges exist. Multiplying $\hat{\mathbf{A}}\mathbf{H}$ mixes the rows of $\mathbf{H}$ together before the linear transform. It's as if the samples in your batch started talking to each other: "Hey, I'm at 50°C, what about you?" The graph structure determines who talks to whom, and the normalization determines how much each neighbor's voice counts.

In short: a GCN is a feedforward network where the batch samples are allowed to exchange information along edges before each layer.

The other extreme — a fully connected graph: What if every node is connected to every other? All degrees equal $n$, so the symmetric normalization gives $\hat{A}_{ij} = 1/n$ for every entry. Now $\hat{\mathbf{A}}\mathbf{H}$ replaces each vertex's features with the uniform average over all vertices. Every vertex gets the same aggregated representation — all local structure is erased.

Physicists will recognize this as the mean field approximation. In the Ising model, each spin interacts with specific neighbors through the lattice. Mean field theory replaces those local interactions with a single effective field: the average over all spins. This is mathematically equivalent to putting the system on a fully connected graph — every spin talks to every other spin equally. The approximation becomes exact in the limit of infinite neighbors (infinite dimensions), precisely because the true local field converges to the global average.

This gives us three regimes to reason about what $\hat{\mathbf{A}}$ does:

No edges ($\hat{\mathbf{A}} = \mathbf{I}$): independent particles — each vertex ignores all others
Sparse graph ($\hat{\mathbf{A}}$ has structure): real interactions — local neighborhood determines each vertex's update
Fully connected ($\hat{\mathbf{A}} = \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top$): mean field — every vertex sees the same global average, local structure is lost

The power of GNNs lies in the middle regime: the graph encodes which interactions matter, preserving local structure that mean field theory deliberately throws away.

Let's unpack what each piece does. Reading the equation left to right:

Figure 4.1. The three steps of a GCN layer. Step ① is the only step that uses the graph structure — it mixes each vertex's features with its neighbors'. Step ② is a standard feedforward transform (same weights for every vertex). Step ③ is a standard nonlinearity. Note: because matrix multiplication is associative, $\hat{\mathbf{A}}(\mathbf{H}\mathbf{W}) = (\hat{\mathbf{A}}\mathbf{H})\mathbf{W}$ — you could equally transform first and aggregate second with the same result. The aggregate-first order shown here is the standard convention.

05Building $\hat{\mathbf{A}}$: The Normalized Adjacency

Before we can run a GCN layer, we need to construct $\hat{\mathbf{A}}$. Let's build it step by step for our Figure 2.1 graph.

Step 1: Add Self-Loops

Without self-loops, the aggregation would only look at neighbors, forgetting the vertex's own features. Adding $\mathbf{I}$ ensures each vertex also "sends a message to itself":

Figure 5.1. Adding self-loops. The diagonal of $\mathbf{A}$ was zero (a vertex isn't its own neighbor); adding $\mathbf{I}$ puts 1s on the diagonal. Now each row sums to $\deg(v) + 1$.

Step 2: Symmetric Normalization

Without normalization, high-degree vertices would dominate (they sum more neighbors). The symmetric normalization $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ produces a weighted average where each edge is downweighted by both endpoint degrees:

Normalization coefficient $$\hat{A}_{ij} = \frac{\tilde{A}_{ij}}{\sqrt{\tilde{d}_i}\sqrt{\tilde{d}_j}}$$

What does $\mathbf{D}^{-1/2}$ actually mean? This notation looks intimidating, but for a diagonal matrix it's trivial. A diagonal matrix $\mathbf{D}$ only has nonzero entries on the diagonal: $D_{ii} = d_i$. Every matrix operation reduces to a scalar operation on each diagonal entry independently:

$\mathbf{D}^{1/2}$: take the square root → $D^{1/2}_{ii} = \sqrt{d_i}$
$\mathbf{D}^{-1}$: take the reciprocal → $D^{-1}_{ii} = 1/d_i$
$\mathbf{D}^{-1/2}$: combine both → $D^{-1/2}_{ii} = 1/\sqrt{d_i}$

For our graph with $\tilde{d} = (4, 3, 4, 3)$:

$\tilde{\mathbf{D}}^{-1/2} = \text{diag}\!\left(\frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}},\; \frac{1}{\sqrt{4}},\; \frac{1}{\sqrt{3}}\right) = \text{diag}(0.500,\; 0.577,\; 0.500,\; 0.577)$

This only works because $\mathbf{D}$ is diagonal. For a general matrix, $\mathbf{M}^{-1/2}$ requires eigendecomposition — but degree matrices are always diagonal, so we never need that here.

Figure 5.2. The normalized adjacency matrix $\hat{\mathbf{A}}$. Each entry is $1/\sqrt{\tilde{d}_i \tilde{d}_j}$ where a connection exists, 0 otherwise. Row sums are approximately 1, so $\hat{\mathbf{A}}$ acts as a soft averaging operator. The degree-3 vertices (v1, v3) get slightly higher self-weight (0.333 vs 0.250) because they have fewer neighbors to dilute over.

Why symmetric? Left-multiplying by $\tilde{\mathbf{D}}^{-1}$ alone (i.e., simple row-normalization) would give an exact row-stochastic average, but it breaks the symmetry of the operator. The symmetric form $\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ keeps $\hat{\mathbf{A}}$ symmetric, which preserves the spectral theory from Chapter 1 — the eigenvalues are real, the eigenvectors are orthogonal, and the connection to the Laplacian ($\mathbf{L}_0 = \mathbf{I} - \hat{\mathbf{A}}_{\text{(without self-loops)}}$) is maintained.

06Full Numerical Walkthrough: One GCN Layer

Now let's run a complete GCN layer on our graph, tracking every number. We'll use 2-dimensional input features (for readability) and map to 3-dimensional outputs.

Setup: Initial Features

Assign each vertex a 2D feature vector — think of these as (normalized temperature, thermal conductivity):

Figure 6.1. The initial features. Each vertex carries a 2D vector. The feature matrix $\mathbf{H}^{(0)}$ stacks all vertex features into a $4 \times 2$ matrix.

Step 1: Aggregate — $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$

Multiplying $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$ computes a normalized weighted average of each vertex's own features and its neighbors' features. Let's trace vertex $v_0$ in full detail:

Figure 6.2. Tracing vertex $v_0$ through the aggregation step. Each neighbor contributes its feature vector, weighted by the normalization coefficient. The result is a smoothed feature that blends $v_0$'s own values with its neighborhood — exactly the kind of local averaging that makes GNNs powerful for tasks like heat prediction or community detection.

Repeating for all vertices, the full aggregated matrix is:

Aggregated features: $\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)}$ $$\hat{\mathbf{A}} \cdot \mathbf{H}^{(0)} = \begin{pmatrix} 0.598 & 0.616 \\ 0.624 & 0.411 \\ 0.598 & 0.616 \\ 0.557 & 0.444 \end{pmatrix}$$

Notice that vertices $v_0$ and $v_2$ get identical aggregated features. This makes sense — they have the same degree (3) and the same set of neighbors ($\{v_0/v_2, v_1, v_3\}$ plus self), so the weighted average is the same. This is a feature of GCN's symmetric normalization: structurally equivalent vertices get identical representations.

Step 2: Linear Transform — $(\hat{\mathbf{A}} \mathbf{H}^{(0)}) \cdot \mathbf{W}$

The weight matrix $\mathbf{W}^{(0)} \in \mathbb{R}^{2 \times 3}$ maps from 2 features to 3 features. These are the learnable parameters — the only part of the GCN that training adjusts:

Figure 6.3. The linear transform projects the aggregated 2D features into a 3D output space. The same weight matrix is applied at every vertex — this is the parameter sharing that makes GNNs efficient and equivariant.

Step 3: Activation — $\sigma(\cdot)$

Apply ReLU element-wise: $\text{ReLU}(x) = \max(0, x)$. In our example, all pre-activation values happen to be positive, so nothing gets zeroed:

Final output: $\mathbf{H}^{(1)} = \text{ReLU}(\hat{\mathbf{A}} \mathbf{H}^{(0)} \mathbf{W}^{(0)})$ $$\mathbf{H}^{(1)} = \text{ReLU}\begin{pmatrix} 0.176 & 0.252 & 0.725 \\ 0.230 & 0.101 & 0.664 \\ 0.176 & 0.252 & 0.725 \\ 0.190 & 0.144 & 0.624 \end{pmatrix} = \begin{pmatrix} 0.176 & 0.252 & 0.725 \\ 0.230 & 0.101 & 0.664 \\ 0.176 & 0.252 & 0.725 \\ 0.190 & 0.144 & 0.624 \end{pmatrix}$$

Each vertex now has a 3-dimensional feature vector that encodes both its own original features and information from its neighbors. Vertex $v_0$ "knows" something about $v_1$, $v_2$, and $v_3$, even though it never saw their features directly — it received them through the aggregation step.

The Complete Pipeline — One Picture

Figure 6.4. The complete data flow of one GCN layer. The 4×2 input is aggregated (mixed with neighbors), projected (2D → 3D by the learned weight matrix), and passed through ReLU. After one layer, each vertex's representation encodes its 1-hop neighborhood.

07Stacking Layers and the Receptive Field

A single layer gives each vertex access to its immediate neighbors. What happens with multiple layers?

Figure 7.1. Receptive field growth. After $k$ GCN layers, each vertex has aggregated information from all vertices within $k$ hops. On our small graph (diameter 2), two layers already give every vertex a global view. On large graphs, this creates the classic GNN trade-off: more layers = wider receptive field, but also more risk of "oversmoothing" where all vertex representations converge.

A Larger Example: The Two-Community Graph

Our 4-node graph has diameter 2, so the receptive field saturates after just 2 layers — every vertex already sees everything. Let's see what happens on a larger graph where depth genuinely matters: a 32-node "two-community" graph with diameter ≈ 7.

Figure 7.2. A two-community graph with 32 vertices. Each cluster has dense internal connections (avg degree ~4), but only two bridge edges connect the clusters. Target vertex $v_0$ (highlighted, deep inside cluster A) would need 7+ hops to reach the far side of cluster B. This makes the receptive field concept non-trivial: 3 layers is not enough for $v_0$ to see the whole graph.

Architecture: A 3-Layer GCN (1 → 3 → 3 → 1)

Each vertex starts with a single scalar feature (e.g., temperature). We stack three GCN layers in an "expand → refine → compress" pattern:

Figure 7.3. Data flow through 3 GCN layers. Each layer applies the same $\hat{\mathbf{A}}$ (the 32×32 normalized adjacency — fixed, not learned) followed by a different $\mathbf{W}^{(\ell)}$ (learned). The weight matrices control the dimension changes: $\mathbf{W}^{(0)} \in \mathbb{R}^{1 \times 3}$ expands from scalar to 3D, $\mathbf{W}^{(1)} \in \mathbb{R}^{3 \times 3}$ refines within 3D, and $\mathbf{W}^{(2)} \in \mathbb{R}^{3 \times 1}$ compresses back to a scalar prediction.

Tracking $v_0$'s Receptive Field Across 3 Layers

Let's follow vertex $v_0$ (deep inside cluster A) and track exactly which vertices contribute to its representation after each layer:

Figure 7.4. Receptive field of $v_0$ after each GCN layer on the two-community graph. Layer 1 reaches direct neighbors (~5 vertices). Layer 2 extends to 2-hop (~12–15 vertices, still entirely within cluster A). Layer 3 finally reaches the bridge vertex $v_7$, crossing into cluster B for the first time. With diameter ≈ 7, three layers give only partial coverage — and for many tasks, this local view is sufficient and even preferable to a global one.

What the hidden dimension does: The 1 → 3 → 3 → 1 funnel isn't arbitrary. Layer 1 expands each scalar to a 3D vector so the network can learn three different "views" of the aggregated neighborhood — perhaps one channel capturing magnitude, another gradient, another curvature. Layer 2 mixes these views with the next ring of neighbors' 3D representations, allowing richer feature interactions. Layer 3 compresses back to a scalar prediction. The hidden dimension (3 here) controls the network's capacity — its ability to represent different mixing patterns. Too small and it can't distinguish neighborhoods; too large and it overfits.

Feature Smoothing and the Heat Diffusion Connection

Each application of $\hat{\mathbf{A}}$ averages a vertex's features with its neighbors' — this is literally a diffusion step. Let's visualize what happens to the feature distribution across vertices as we stack layers:

Figure 7.5. Feature smoothing across layers. Each bar shows one vertex's feature value. Layer 0: features are diverse (original signal). Layer 1: local averaging reduces variance within neighborhoods. Layer 2: broader smoothing produces cluster-level patterns. Layer 3: inter-cluster differences begin to erode. The learnable weights $\mathbf{W}$ partially counteract this smoothing by selectively amplifying useful patterns, but the trend toward uniformity is inherent in repeated averaging.

The oversmoothing problem: If we stacked 10 layers on this 32-node graph, every vertex would see the entire graph multiple times over. All representations would converge toward the graph-wide average — vertices in cluster A would become indistinguishable from vertices in cluster B. This is the oversmoothing problem, and it's why most practical GCNs use only 2–4 layers. The learnable weights $\mathbf{W}$ can slow the convergence but cannot fully prevent it when the number of layers far exceeds the graph diameter.

Summary: Small Graph vs. Large Graph

Aspect	4-node example (§6)	32-node example (§7)
Graph diameter	2	≈ 7
Layers to saturate receptive field	2	7+
Layers used	1	3
Coverage after all layers	100% (global)	≈ 50% (still local)
Feature dimensions	2 → 3	1 → 3 → 3 → 1
Key lesson	GCN mechanics	Depth vs. receptive field trade-off

08The GCN–Laplacian Connection

There's a deep connection between the GCN layer and the Laplacian from Chapter 1. The original (unnormalized) adjacency acts as a diffusion step, and the GCN normalization is a specific choice of diffusion rate:

GCN as a spectral filter $$\hat{\mathbf{A}} = \mathbf{I} - \hat{\mathbf{L}}_0 \qquad \text{where}\qquad \hat{\mathbf{L}}_0 = \mathbf{I} - \tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$$

So $\hat{\mathbf{A}}\mathbf{H} = (\mathbf{I} - \hat{\mathbf{L}}_0)\mathbf{H} = \mathbf{H} - \hat{\mathbf{L}}_0\mathbf{H}$. This is exactly one step of heat diffusion: the output is the input minus a Laplacian-weighted correction. High-frequency components (large eigenvalues of $\hat{\mathbf{L}}_0$) get attenuated more — the GCN layer is a low-pass filter.

The bridge to TDL: A GCN layer uses $\hat{\mathbf{A}}$ (derived from the graph Laplacian $\mathbf{L}_0$) to pass messages between vertices along edges. In Chapters 3–5, we'll build boundary operators $\mathbf{B}_{1,2}$, $\mathbf{B}_{2,3}$, ... that connect edges to faces to volumes. The higher-order message passing framework replaces $\hat{\mathbf{A}}$ with Hodge Laplacians and incidence matrices, enabling messages to flow between objects of any rank — not just between adjacent vertices. That's the core innovation of topological deep learning.