Chapter 2

From NMF to Variational Autoregressive Networks

Building the one-layer VAN from the NMF foundation — architecture, bias question, pairwise correlations, and training

In this chapter
  1. 05The Autoregressive Decomposition
  2. 06Bernoulli Conditionals and the One-Layer Architecture
  3. 07The Lower-Triangular Weight Matrix
  4. 08Why One Layer Captures Pairwise Correlations
  5. 09Worked Numerical Example: N=3 Ising Chain
  6. 10The VAN Training Loop

05The Autoregressive Decomposition

In Chapter 1 we built the NMF ansatz by assuming that the variational distribution factorizes into independent single-site distributions: $q(s) = \prod_i q_i(s_i)$. This was convenient—it made every expectation tractable—but it killed all correlations between spins. Now we ask: can we write $q(s)$ in a form that is still tractable but allows correlations?

The answer comes from a basic identity in probability theory. By the chain rule of probability, any joint distribution over $N$ variables can be decomposed as a product of conditionals:

Autoregressive Factorization

Any joint distribution over $N$ binary variables can be written exactly as $$q(s_1, \ldots, s_N) = \prod_{i=1}^{N} q(s_i \mid s_1, \ldots, s_{i-1})$$ where $s_{<i} = (s_1, \ldots, s_{i-1})$ and $s_{<1} = \emptyset$.

This is not an approximation. It is an exact identity, valid for any probability distribution whatsoever. The product of conditionals reconstructs the full joint distribution by construction. So where does the approximation enter?

Where the approximation lives. All approximation enters through the parameterization of the conditionals $q_\theta(s_i \mid s_{<i})$, not through the factorization itself. If we could represent each conditional exactly, the autoregressive form would reproduce any target distribution. In practice, we parameterize the conditionals with a finite neural network, and that parameterization limits the family of distributions we can express.

The contrast with NMF is now clear. NMF makes each conditional independent of its predecessors:

$$\underbrace{q(s_i \mid s_{<i}) = q_i(s_i)}_{\text{NMF: ignore all predecessors}} \qquad \text{vs.} \qquad \underbrace{q_\theta(s_i \mid s_{<i}) = f_\theta(s_1, \ldots, s_{i-1})}_{\text{VAN: keep the dependence}}$$

NMF corresponds to the special case where the conditioning is discarded. VAN keeps the conditioning and parameterizes it with learnable weights. The autoregressive structure—each variable conditioned on all predecessors—is depicted in the causal dependency graph below.

s₁ s₂ s₃ s₄ no inputs sees s₁ sees s₁, s₂ sees s₁, s₂, s₃
Figure 2.1. Causal dependency structure for $N = 4$. Each variable can depend on all predecessors. The number of incoming arrows grows as $0, 1, 2, \ldots, N{-}1$, giving $N(N{-}1)/2$ directed edges in total.

06Bernoulli Conditionals and the One-Layer Architecture

Since each spin takes values $s_i \in \{+1, -1\}$, the conditional distribution $q_\theta(s_i \mid s_{<i})$ is a Bernoulli distribution characterized by a single number—the conditional magnetization:

$$\mu_i \equiv \avg{s_i}_{q(\cdot \mid s_{<i})} = \sum_{s_i = \pm 1} s_i\, q_\theta(s_i \mid s_{<i}).$$

The simplest nontrivial parameterization applies a sigmoid (hyperbolic tangent) activation to a linear function of the conditioning variables:

One-Layer Parameterization
$$\mu_i = \tanh\!\left(b_i + \sum_{j=1}^{i-1} W_{ij}\, s_j\right) \tag{32}$$

with $q_\theta(s_i \mid s_{<i}) = (1 + \mu_i\, s_i) / 2$. Here $b_i$ is a bias and $W_{ij}$ (for $j < i$) is the weight connecting the input $s_j$ to the output $\mu_i$. The autoregressive constraint is enforced structurally: $W_{ij} = 0$ for $j \geq i$ (site $i$ cannot see itself or future sites).

The bias question: paper vs. notes conventions. This is a subtle but important point. The VAN paper (Eq. 8) and our notes (Eq. 32) use different conventions that are related but not identical:

The VAN paper uses: $\hat{s}_i = \sigma\!\left(\sum_{j<i} W_{ij}\, s_j\right)$ with no biases, sigmoid activation $\sigma(x) = 1/(1+e^{-x})$, and spin encoding $s \in \{0, 1\}$.

Our notes use: $\mu_i = \tanh\!\left(b_i + \sum_{j<i} W_{ij}\, s_j\right)$ with biases, tanh activation, and spin encoding $s \in \{+1, -1\}$.

These are related by $s_i = 2\hat{s}_i - 1$ and $\tanh(x) = 2\sigma(2x) - 1$. The key behavioral difference is what happens when we set $W = 0$:

  • Setting $W = 0$ with biases gives $\mu_i = \tanh(b_i)$, which is a general NMF ansatz with arbitrary magnetizations $m_i = \tanh(b_i)$.
  • Setting $W = 0$ without biases (the paper) gives $\hat{s}_i = \sigma(0) = 0.5$, i.e., a uniform distribution ($m_i = 0$) for every spin.

Why the paper omits biases: For the symmetric Ising model ($h = 0$), the optimal NMF solution above $T_c$ has $m_i = 0$ anyway. Below $T_c$, the symmetry-broken solutions have $m_i = \pm m_0$, but the choice of sign is arbitrary. During gradient-based optimization, the network discovers the right baseline through the weights themselves. For models with external fields ($h \neq 0$), biases would matter and should be included.

The paper uses $N(N{-}1)/2$ parameters (weights only); with biases the count is $N(N{+}1)/2$.

PropertyPaper (Eq. 8)Notes (Eq. 32)
Spin encoding$s_i \in \{0, 1\}$$s_i \in \{+1, -1\}$
ActivationSigmoid $\sigma(x)$Tanh
BiasesNoYes ($b_i$)
$W = 0$ limitUniform ($p = 0.5$)General NMF
Parameters$N(N{-}1)/2$$N(N{+}1)/2$
Output$\hat{s}_i = \sigma(\cdot) \in (0,1)$$\mu_i = \tanh(\cdot) \in (-1,1)$

07The Lower-Triangular Weight Matrix

The autoregressive constraint—site $i$ sees only sites $1, \ldots, i{-}1$—is enforced by requiring the weight matrix $W$ to be strictly lower-triangular. Explicitly:

Strictly Lower-Triangular Weight Matrix
$$W = \begin{pmatrix} 0 & 0 & 0 & \cdots & 0 \\ W_{21} & 0 & 0 & \cdots & 0 \\ W_{31} & W_{32} & 0 & \cdots & 0 \\ \vdots & & \ddots & & \vdots \\ W_{N1} & W_{N2} & \cdots & W_{N,N-1} & 0 \end{pmatrix}$$

Every entry on or above the diagonal is zero. The number of free entries below the diagonal is:

Parameter Count
$$\underbrace{N}_{\text{biases}} + \underbrace{\frac{N(N-1)}{2}}_{\text{weights}} = \frac{N(N+1)}{2} \quad \text{total parameters} \tag{33}$$
input: s₁ … s₅ output: μ₁ … μ₅ 0 0 0 0 0 W₂₁ 0 0 0 0 W₃₁ W₃₂ 0 0 0 W₄₁ W₄₂ W₄₃ 0 0 W₅₁ W₅₂ W₅₃ W₅₄ 0 learnable weight masked (zero)
Figure 2.2. The strictly lower-triangular weight matrix for $N = 5$. Upper triangle and diagonal cells are masked to zero, enforcing the autoregressive constraint: $\mu_i$ depends only on $s_1, \ldots, s_{i-1}$.

To build concrete intuition, here is what the parameterization looks like for $N = 4$:

$$\begin{aligned} \mu_1 &= \tanh(b_1) &&\text{depends on nothing (independent Bernoulli)} \\[4pt] \mu_2 &= \tanh(b_2 + W_{21}\, s_1) &&\text{depends on } s_1 \\[4pt] \mu_3 &= \tanh(b_3 + W_{31}\, s_1 + W_{32}\, s_2) &&\text{depends on } s_1, s_2 \\[4pt] \mu_4 &= \tanh(b_4 + W_{41}\, s_1 + W_{42}\, s_2 + W_{43}\, s_3) &&\text{depends on } s_1, s_2, s_3 \end{aligned}$$

The first spin is always sampled independently. Each subsequent spin gets one more input than the last. The total number of weights is $0 + 1 + 2 + 3 = 6 = N(N{-}1)/2$ for $N = 4$, plus 4 biases, giving $4(5)/2 = 10$ parameters total.

NMF as the origin of VAN parameter space. Setting all $W_{ij} = 0$ gives $\mu_i = \tanh(b_i)$, independent of $s_{<i}$. Then $q_\theta(s) = \prod_i q_i(s_i)$ with $m_i = \tanh(b_i)$, recovering exactly the NMF ansatz of Chapter 1. The biases are related to the NMF magnetizations by $b_i = \tanh^{-1}(m_i)$. NMF sits at the origin of the VAN parameter space: the simplest, cheapest, but least expressive member of the family.


08Why One Layer Captures Pairwise Correlations

With the parameterization of Eq. (32), the log-probability of a configuration is:

$$\ln q_\theta(s) = \sum_{i=1}^{N} \ln q_\theta(s_i \mid s_{<i}) = \sum_{i=1}^{N} \ln \frac{1 + \mu_i\, s_i}{2}.$$

What kind of distribution does this represent? To find out, expand the tanh inside $\mu_i$ to leading order in the weights $W_{ij}$. This is exact when $W$ is small and gives the right qualitative picture even for moderate $W$.

  1. Expand tanh to first order in $W$. For small $W$, $$\mu_i \approx \tanh(b_i) + \bigl(1 - \tanh^2(b_i)\bigr) \sum_{j < i} W_{ij}\, s_j.$$ Define $m_i^{(0)} = \tanh(b_i)$ as the NMF magnetization. The correction is proportional to $(1 - (m_i^{(0)})^2)$, which is largest when $m_i^{(0)} \approx 0$ (maximum uncertainty) and vanishes as $|m_i^{(0)}| \to 1$ (fully polarized spin).
  2. Substitute into $\ln q_\theta(s_i \mid s_{<i})$. Using $q_\theta(s_i \mid s_{<i}) = (1 + \mu_i\, s_i) / 2$ and expanding the logarithm for small perturbations around the NMF baseline: $$\ln q_\theta(s_i \mid s_{<i}) \approx \text{const}(b_i) + \alpha_i\, s_i + \sum_{j < i} \tilde{W}_{ij}\, s_i\, s_j + \cdots$$ where $\alpha_i$ depends on $b_i$ and $\tilde{W}_{ij}$ is proportional to $W_{ij}$.
  3. Sum over $i$ to get $\ln q_\theta(s)$. $$\ln q_\theta(s) = \sum_i \ln q_\theta(s_i \mid s_{<i}) \approx \text{const} + \sum_i \alpha_i\, s_i + \sum_{i > j} \tilde{W}_{ij}\, s_i\, s_j + \cdots$$
  4. Identify the structure. The result has three types of terms: a constant (normalization), linear terms $\alpha_i\, s_i$ (effective local fields), and pairwise terms $\tilde{W}_{ij}\, s_i\, s_j$ (effective couplings). This is exactly the form of a general pairwise Ising model.
One-Layer VAN as Pairwise Model

To leading order in $W$, the one-layer VAN log-probability is a general pairwise Ising model: $$\ln q_\theta(s) \approx \text{const} + \sum_i \alpha_i\, s_i + \sum_{i > j} \tilde{W}_{ij}\, s_i\, s_j. \tag{34}$$ The $N(N{-}1)/2$ weights map onto $N(N{-}1)/2$ independent pairwise couplings—one for each spin pair.

Connections to other models. This connects to two important ideas:

(1) Boltzmann machines—an undirected graphical model with the same pairwise structure $\ln p(s) \propto \sum_{ij} J_{ij}\, s_i\, s_j + \sum_i h_i\, s_i$, but with an intractable normalization constant (the partition function $Z$ requires summing over all $2^N$ configurations). VAN achieves a similar expressiveness with tractable $\ln q_\theta(s)$—the autoregressive structure gives us the normalization for free.

(2) The inverse Ising problem—given observed correlations $\avg{s_i\, s_j}$, find the couplings $J_{ij}$ that reproduce them. The one-layer VAN solves a variational version of this problem: it finds the pairwise couplings that minimize $\KL(q_\theta \| p)$.


09Worked Numerical Example: $N = 3$ Ising Chain

To make all of this concrete, we work through a complete example: a 3-spin Ising chain with periodic boundary conditions (a ring), coupling $J > 0$, no external field ($h = 0$), at inverse temperature $\beta J = 0.5$.

Step 1 — Hamiltonian and Configurations

The Hamiltonian for the 3-site ring is:

$$E(s) = -J(s_1\, s_2 + s_2\, s_3 + s_3\, s_1). \tag{35}$$

All three pairs are nearest neighbors on the ring. There are $2^3 = 8$ configurations:

$s_1$$s_2$$s_3$$s_1 s_2 + s_2 s_3 + s_3 s_1$$E/J$$e^{-\beta E}$ ($\beta J = 0.5$)
$+1$$+1$$+1$$+3$$-3$$e^{1.5} \approx 4.482$
$+1$$+1$$-1$$-1$$+1$$e^{-0.5} \approx 0.607$
$+1$$-1$$+1$$-1$$+1$$e^{-0.5} \approx 0.607$
$-1$$+1$$+1$$-1$$+1$$e^{-0.5} \approx 0.607$
$+1$$-1$$-1$$-1$$+1$$e^{-0.5} \approx 0.607$
$-1$$+1$$-1$$-1$$+1$$e^{-0.5} \approx 0.607$
$-1$$-1$$+1$$-1$$+1$$e^{-0.5} \approx 0.607$
$-1$$-1$$-1$$+3$$-3$$e^{1.5} \approx 4.482$

There are two all-aligned configurations (energy $-3J$) and six partially disordered configurations (energy $+J$).

Step 2 — Exact Quantities

Partition function:

$$Z = 2\, e^{1.5} + 6\, e^{-0.5} = 2(4.482) + 6(0.607) = 8.964 + 3.642 = 12.606. \tag{36}$$

Free energy: $$F_{\text{exact}} = -\frac{1}{\beta} \ln Z = -2J \ln 12.606 \approx -5.069J, \qquad f_{\text{exact}} = F_{\text{exact}} / 3 \approx -1.690J.$$

Magnetization: By the $s \to -s$ symmetry of the Hamiltonian (no external field), $\avg{s_i} = 0$ for all $i$.

Nearest-neighbor correlation: We compute $\avg{s_1\, s_2}$ by summing $s_1\, s_2 \times e^{-\beta E}$ over all configurations. Of the 8 configurations, 4 have $s_1 = s_2$ (giving $s_1 s_2 = +1$) and 4 have $s_1 \neq s_2$ (giving $s_1 s_2 = -1$):

$$\begin{aligned} \sum_{s:\, s_1 s_2 = +1} e^{-\beta E} &= 2\, e^{1.5} + 2\, e^{-0.5} = 8.964 + 1.214 = 10.178 \\[4pt] \sum_{s:\, s_1 s_2 = -1} e^{-\beta E} &= 4\, e^{-0.5} = 2.428 \end{aligned}$$

Therefore: $$\avg{s_1\, s_2}_{\text{exact}} = \frac{10.178 - 2.428}{12.606} = \frac{7.750}{12.606} \approx 0.615. \tag{37}$$

By the translational symmetry of the ring, $\avg{s_2\, s_3}_{\text{exact}} = \avg{s_3\, s_1}_{\text{exact}} = 0.615$ as well.

Step 3 — NMF Limit ($W = 0$, Biases Only)

With $h = 0$ and the NMF ansatz, the self-consistency equation from Chapter 1 gives $m = \tanh(z\,\beta J\, m)$ where $z = 2$ is the coordination number. At $\beta J = 0.5$:

$$m = \tanh(2 \times 0.5 \times m) = \tanh(m).$$

The slope of $\tanh(m)$ at $m = 0$ is exactly 1, so we are precisely at the NMF critical point ($k_B T_c^{\text{NMF}} = zJ = 2J$, i.e., $\beta_c J = 0.5$). The only solution is $m = 0$—a uniform distribution.

NMF free energy per site: With $m = 0$ (equal probability for $\pm 1$), the energy contribution is $\avg{E}_q / N = -J\, m^2 = 0$ and the entropy is $S / N = k_B \ln 2$. So:

$$f_q^{\text{NMF}} = 0 - k_B T \ln 2 = -2J \ln 2 \approx -1.386J.$$

NMF correlation: Since the NMF ansatz factorizes, $\avg{s_1\, s_2}_{\text{NMF}} = \avg{s_1}\avg{s_2} = m^2 = 0$.

Step 4 — One-Layer VAN Improvement

With $W \neq 0$, the VAN can introduce correlations. The parameters are $b_1, b_2, b_3, W_{21}, W_{31}, W_{32}$—six parameters in total.

For the symmetric $h = 0$ Ising model, the $s \to -s$ symmetry implies the optimal biases are $b_i = 0$. By the translational symmetry of the 3-site ring (all pairs are nearest neighbors), all weights should be equal: $W_{21} = W_{31} = W_{32} \equiv w$. This reduces the problem to a single parameter $w$.

Any positive $w$ introduces positive correlations between spins. To leading order in $w$, the energy expectation $\avg{E}_q$ decreases (spins become more aligned, lowering energy), while the entropy penalty $-T\avg{\ln q_\theta}_q$ grows only as $w^2$. The optimal $w^*$ balances these effects, giving a free energy that is strictly better (lower) than the NMF value.

QuantityNMF ($W = 0$)1-Layer VANExact
$f / J$$-1.386$improves toward exact$-1.690$
$\avg{s_i}$$0$$0$$0$
$\avg{s_i\, s_{i+1}}$$0$$> 0$$0.615$
Parameters3 (biases)6 ($b_i + W_{ij}$)

Three parameters make all the difference. Even with just 3 additional parameters (the weights $W_{21}, W_{31}, W_{32}$), the one-layer VAN can capture nearest-neighbor correlations that NMF entirely misses. The improvement in free energy directly measures the reduction in $\KL(q_\theta \| p)$.


10The VAN Training Loop

We have built the VAN ansatz. Now: how do we train it? The loss function is the variational free energy itself:

$$\Fq = \avg{E(s) + \frac{1}{\beta}\ln q_\theta(s)}_{q_\theta}. \tag{38}$$

This is the same functional we minimized analytically for NMF in Chapter 1, but now we optimize it by gradient descent over the VAN parameters $\theta = \{b_i, W_{ij}\}$.

The gradient of $\Fq$ with respect to $\theta$ is computed using the REINFORCE (score function) estimator. Since the expectation is over $q_\theta$ which itself depends on $\theta$, we use the log-derivative trick $\nabla_\theta q_\theta = q_\theta \nabla_\theta \ln q_\theta$ to move the gradient inside the expectation:

REINFORCE Gradient (VAN Paper Eq. 5)
$$\nabla_\theta \Fq = \left\langle \left(\beta E(s) + \ln q_\theta(s)\right) \nabla_\theta \ln q_\theta(s) \right\rangle_{q_\theta} \tag{39}$$

In practice, this expectation is estimated by Monte Carlo: draw $K$ samples $s^{(1)}, \ldots, s^{(K)}$ from $q_\theta$, compute the quantity in brackets for each, and average.

The autoregressive structure is what makes this entire procedure practical. It gives us two properties that most variational ansatzes lack:

1. Tractable log-likelihood. The log-probability of any configuration is: $$\ln q_\theta(s) = \sum_{i=1}^{N} \ln q_\theta(s_i \mid s_{<i}) = \sum_{i=1}^{N} \ln \frac{1 + \mu_i\, s_i}{2}.$$ This is a sum of $N$ terms, each computable in $O(i)$ time. No intractable partition function appears.

2. Trivial sequential sampling. To draw a sample from $q_\theta$, simply sample $s_1$ from $q_\theta(s_1)$, then $s_2$ from $q_\theta(s_2 \mid s_1)$, then $s_3$ from $q_\theta(s_3 \mid s_1, s_2)$, and so on. Each step is a single Bernoulli draw with parameter $(1 + \mu_i) / 2$. No MCMC, no Markov chains, no burn-in, no autocorrelation.

The three-property comparison. NMF has tractable likelihood and sampling but no correlations. Boltzmann machines have correlations but intractable likelihood (the partition function is unknown) and require MCMC for sampling. VAN has all three: tractable likelihood, trivial sampling, and tunable correlations through the autoregressive parameterization. This combination is what makes VAN a practical tool for variational statistical mechanics.

Variance Reduction

The REINFORCE gradient (Eq. 39) is an unbiased estimator, but it can have high variance—a well-known problem in reinforcement learning. The VAN paper (Eqs. 14–15) uses baseline subtraction to reduce variance: replace $\beta E(s) + \ln q_\theta(s)$ with $\beta E(s) + \ln q_\theta(s) - c$ where $c = \avg{\beta E + \ln q_\theta}_{q_\theta}$ is the running mean. Since $c$ does not depend on the sample $s$, the gradient $\avg{c \,\nabla_\theta \ln q_\theta}_{q_\theta} = c \,\nabla_\theta \sum_s q_\theta(s) = c \,\nabla_\theta 1 = 0$, so the baseline does not bias the gradient. In practice, $c$ is estimated as an exponential moving average over training batches.

Beyond One Layer: Deep VAN

Throughout this chapter we have focused on the one-layer parameterization (Eq. 32), where $\mu_i$ is a linear function of $s_{<i}$ passed through tanh. Adding hidden layers to the network parameterizing $\mu_i$ allows it to be a nonlinear function of $s_{<i}$. In the small-weight expansion of §8, this generates effective multi-spin interactions—three-body ($s_i\, s_j\, s_k$), four-body, and higher—in $\ln q_\theta$. This is how deep VAN captures the higher-order correlations needed near critical points, and why the VAN paper reports dramatic improvements over both NMF and one-layer VAN for the 2D Ising model near $T_c$.


References

[VAN] D. Wu, L. Wang, and P. Zhang, “Solving Statistical Mechanics Using Variational Autoregressive Networks,” Phys. Rev. Lett. 122, 080602 (2019). arXiv:1809.10606.

[Pandey–Xu] M. Pandey and M. Xu, “Mean Field Theory and its application to deep learning” (2019). mimee.xyz/meanfield.pdf.

[Arovas] D. Arovas, “Lecture Notes on Thermodynamics and Statistical Mechanics,” Ch. 7, UCSD Physics 210a.