1Variational Free Energy
We follow the notation of Wu, Wang & Zhang (arXiv:1809.10606), referred to throughout as the VAN paper. The key idea is: we cannot compute the partition function $Z$ exactly for interesting models, so we introduce a tractable approximate distribution $q_\theta(s)$ and minimize a variational upper bound on the true free energy.
The Boltzmann distribution and the partition function
Consider a system of $N$ binary spins $s = (s_1, s_2, \ldots, s_N)$ where each $s_i \in \{+1, -1\}$. The joint probability follows the Boltzmann distribution:
where $\beta = 1/(k_B T)$ is the inverse temperature, $E(s)$ is the energy of configuration $s$, and $Z$ is the partition function. The sum $\sum_s$ runs over all $2^N$ configurations.
The free energy is:
Computing $F$ exactly requires summing over exponentially many configurations — a $\#$P-hard problem in general.
The variational ansatz $q_\theta(s)$
We introduce a variational distribution $q_\theta(s)$, parameterized by some set of variational parameters $\theta$. The idea: we will choose $q_\theta$ from a tractable family of distributions, and adjust $\theta$ so that $q_\theta$ is as close as possible to the true Boltzmann distribution $p$.
The closeness is measured by the Kullback–Leibler (KL) divergence:
$\KL = 0$ if and only if $q_\theta(s) = p(s)$ for all $s$.
KL divergence and the variational free energy $\Fq$
Substituting $p(s) = e^{-\beta E(s)}/Z$ into (3):
- $$ \KL(q_\theta \| p) = \sum_s q_\theta(s) \left[\ln q_\theta(s) - \ln p(s)\right] = \sum_s q_\theta(s) \left[\ln q_\theta(s) + \beta E(s) + \ln Z\right]. $$
- Since $\sum_s q_\theta(s) = 1$: $$ \KL(q_\theta \| p) = \sum_s q_\theta(s)\left[\beta E(s) + \ln q_\theta(s)\right] + \ln Z = \beta\!\left(\underbrace{\frac{1}{\beta}\sum_s q_\theta(s)\left[\beta E(s) + \ln q_\theta(s)\right]}_{\displaystyle \Fq}\right) - \beta F\,. $$
This gives us the fundamental variational relation (VAN paper, Eq. 2):
where the variational free energy is (VAN paper, Eq. 3):
We can write this more transparently as:
That is,
where we have introduced the Shannon entropy (dimensionless, measured in nats):
and the thermodynamic (Gibbs) entropy $S_q = k_B \,\mathcal{H}_q$, which carries the dimensions of $k_B$ (energy/temperature).
Shannon entropy vs. physics entropy. Throughout these notes we use $\mathcal{H}$ for the dimensionless Shannon entropy and $S = k_B \mathcal{H}$ for the physical entropy. The relation is simply:
$$ S_{\text{physics}} = k_B \times \mathcal{H}_{\text{Shannon}}\,. $$Many machine-learning references (including the VAN paper) set $k_B = 1$, so that $S$ and $\mathcal{H}$ coincide and $\beta = 1/T$. Most physics textbooks keep $k_B$ explicit. We carry it here so you can compare directly with both conventions. When you see a physics reference writing $S = -k_B \sum p \ln p$, or an ML reference writing $H = -\sum p \ln p$ with $\beta = 1/T$, they are saying the same thing.
Key point: Since $\KL \geq 0$, we have $\Fq \geq F$. The variational free energy is an upper bound on the true free energy. Minimizing $\Fq$ over $\theta$ tightens this bound, making $q_\theta$ approximate $p$ as closely as possible within the chosen family.
2The NMF Ansatz
The naive mean field (NMF) ansatz
The simplest variational family is a fully factorized distribution:
where each $q_i(s_i)$ is a Bernoulli distribution over $s_i \in \{+1, -1\}$. Since each $q_i$ must be normalized, it is characterized by a single parameter. Let us parameterize it via the local magnetization:
Combined with normalization $q_i(+1) + q_i(-1) = 1$, we get:
So the variational parameters are $\theta = (m_1, m_2, \ldots, m_N)$.
Physical meaning. The factorized ansatz says: the spins are statistically independent. Each spin $s_i$ has its own probability of being up or down, but there are no correlations between different spins. This is the defining approximation of naive mean field theory. The VAN paper generalizes this by replacing the product of independent marginals with a product of conditional probabilities — the autoregressive decomposition — which can capture correlations.
Evaluating $\Fq$ under the NMF ansatz
We need to compute the two pieces of $\Fq$: the average energy and the entropy.
Entropy
Since $q_\theta(s) = \prod_i q_i(s_i)$ and the spins are independent under $q$, the Shannon entropy factorizes:
- $$ \mathcal{H}_q = -\sum_s q_\theta(s) \ln q_\theta(s) = -\sum_s \prod_j q_j(s_j) \sum_i \ln q_i(s_i)\,. $$
- In the sum over all configurations, the product factorizes. For each term $i$, the sums over all $s_j$ with $j \neq i$ just give 1 (normalization), leaving only the sum over $s_i$: $$ \mathcal{H}_q = -\sum_{i=1}^{N} \sum_{s_i=\pm 1} q_i(s_i) \ln q_i(s_i) \;\equiv\; \sum_{i=1}^{N} \mathcal{H}(q_i)\,, $$
- where $\mathcal{H}(q_i)$ is the binary Shannon entropy of the $i$-th marginal. Substituting $q_i(s_i) = (1+m_i s_i)/2$:
and the thermodynamic entropy is $S_q = k_B\,\mathcal{H}_q = k_B \sum_i \mathcal{H}(m_i)$.
Note that $\mathcal{H}(0) = \ln 2$ (maximum uncertainty, $\sim 0.693$ nats) and $\mathcal{H}(\pm 1) = 0$ (fully determined spin).
Average energy
The average energy depends on the specific model. We keep things general here and specialize in Sections 3–4. For any energy that is a sum of pairwise interactions and local fields,
$$ E(s) = -\sum_{\langle i,j\rangle} J_{ij}\, s_i s_j - \sum_i h_i\, s_i\,, $$the factorized ansatz gives (since $\avg{s_i s_j}_q = \avg{s_i}_q \avg{s_j}_q = m_i m_j$ for $i \neq j$):
This is the key simplification. Under the factorized ansatz, the two-point correlation $\avg{s_i s_j}_q$ factorizes into $m_i m_j$. This is the mathematical content of "replacing interactions by an effective mean field." It is exact within the variational distribution $q$ (the spins really are independent under $q$), but it makes $q$ a poor approximation to $p$ whenever the true correlations matter.
Putting it together
The variational free energy for the general pairwise model is:
Minimizing $\Fq$: the NMF self-consistency equations
We minimize $\Fq$ with respect to each $m_i$. Setting $\partial \Fq / \partial m_k = 0$:
- Energy derivative: $$ \frac{\partial}{\partial m_k}\left(-\sum_{\langle i,j\rangle} J_{ij} m_i m_j - \sum_i h_i m_i\right) = -\sum_{j \in \partial k} J_{kj}\, m_j - h_k\,, $$ where $\partial k$ denotes the neighbors of site $k$.
- Entropy derivative: $$ \frac{\partial}{\partial m_k}\bigl(-k_B T\, \mathcal{H}(m_k)\bigr) = -k_B T \cdot \frac{\partial \mathcal{H}}{\partial m_k} = -k_B T \cdot \left(-\frac{1}{2}\ln\frac{1+m_k}{2} + \frac{1}{2}\ln\frac{1-m_k}{2}\right)\,. $$ This simplifies using $\tanh^{-1}$: $$ \frac{\partial \mathcal{H}}{\partial m_k} = -\frac{1}{2}\ln\frac{1+m_k}{1-m_k} = -\tanh^{-1}(m_k)\,. $$ So the entropy contribution is $+k_B T\,\tanh^{-1}(m_k)$.
- Setting to zero: $$ -\sum_{j\in\partial k} J_{kj}\, m_j - h_k + k_B T\,\tanh^{-1}(m_k) = 0\,. $$
Solving for $m_k$:
These are the naive mean field (NMF) self-consistency equations. There is one equation for each site $k$. The term in brackets is the effective field experienced by spin $k$:
Each spin sees its neighbors only through their mean magnetization $m_j$ — hence the name "mean field."
Comparison with the VAN framework. In the VAN paper, the NMF ansatz $q_\theta(s) = \prod_i q_i(s_i)$ is the simplest special case of the autoregressive ansatz $q_\theta(s) = \prod_i q_\theta(s_i | s_1,\ldots,s_{i-1})$. In NMF, each conditional $q_\theta(s_i | s_{<i})$ is independent of $s_{<i}$; it collapses to $q_i(s_i)$. The VAN paper's contribution is to parameterize the conditionals with neural networks, allowing correlations while retaining a tractable, normalized $q_\theta$.
31D Ising Application
Setup and energy function
Consider a ring of $N$ spins with nearest-neighbor ferromagnetic coupling $J > 0$ and no external field ($h = 0$). The energy is:
Each site has exactly $z = 2$ neighbors (left and right). There are $N$ bonds in total (the periodic bond $s_N s_1$ gives the $N$-th bond).
NMF variational free energy for the 1D chain
We assume a uniform (translation-invariant) solution: $m_i = m$ for all $i$. This is natural given the periodic boundary conditions and the translation symmetry of the Hamiltonian. Then:
Average energy
From Eq. (12) with $J_{ij} = J$ for neighboring pairs and $h_i = 0$:
(There are $N$ bonds, each contributing $-J\, m^2$.)
Entropy
From Eq. (11), the Shannon entropy is $\mathcal{H}_q = N\,\mathcal{H}(m)$, so the thermodynamic entropy is:
Variational free energy
Combining via $\Fq = \avg{E}_q - T S_q$, and dividing by $N$ to get the free energy per site $f_q \equiv \Fq / N$:
Structure of $f_q(m)$. The first term ($-Jm^2$) favors large $|m|$ — it is the energy gain from alignment. The second term (the entropy cost, with coefficient $+k_BT$) favors $m = 0$ — it is maximized when the spin is equally likely up and down. The competition between energy and entropy is the essence of the phase transition physics.
Deriving the NMF equations
We minimize $f_q(m)$ by setting $df_q/dm = 0$:
- Differentiate: $$ \frac{df_q}{dm} = -2Jm + k_B T \cdot \frac{1}{2}\ln\frac{1+m}{1-m} = -2Jm + k_B T\,\tanh^{-1}(m)\,. $$
- Set to zero and solve: $$ k_B T\,\tanh^{-1}(m) = 2Jm \qquad\Longrightarrow\qquad m = \tanh\!\left(\frac{2J}{k_B T}\, m\right) = \tanh(2\beta J\, m)\,. $$
This is the special case of the general NMF equation (14) for a uniform solution on the 1D chain. Each spin has $z = 2$ neighbors, each with magnetization $m$, so the effective field is $h^{\mathrm{eff}} = 2Jm$:
Note on the coordination number. The factor of 2 in $2\beta J m$ comes from the coordination number $z = 2$ (each spin has 2 neighbors on the periodic chain). In general dimension $d$ on a hypercubic lattice, $z = 2d$, and the NMF equation becomes $m = \tanh(z \beta J m)$ in the uniform case.
Solving for the phase transition
Equation (20) is a transcendental equation for $m$. We solve it graphically/analytically.
The graphical construction makes the bifurcation structure clear: below $T_c$, the sigmoid $\tanh(2\beta J m)$ is steeper than the diagonal at the origin, producing two additional fixed points at $\pm m_0(T)$. At $T = T_c$ the sigmoid is tangent to $y = m$ at the origin (slope $= 1$). Above $T_c$ only $m = 0$ survives.
$m = 0$ is always a solution (paramagnetic state).
Non-zero solutions exist when the slope of the RHS at $m = 0$ exceeds 1. Since $\frac{d}{dm}\tanh(2\beta J m)\big|_{m=0} = 2\beta J$, a bifurcation occurs when:
For $T < T_c^{\text{NMF}}$, there are two additional solutions $m = \pm m_0(T)$ with $m_0 > 0$, corresponding to a ferromagnetically ordered state. This is a continuous (second-order) phase transition within NMF.
Critical behavior near $T_c$
Expanding $\tanh(x) \approx x - x^3/3$ for small $m$ near $T_c$:
- $$ m \approx 2\beta J\, m - \tfrac{1}{3}(2\beta J)^3 m^3 \qquad\Longrightarrow\qquad m^2 \approx \frac{3(2\beta J - 1)}{(2\beta J)^3}\,. $$
- Near $T_c$, with $t \equiv (T_c - T)/T_c$ the reduced temperature: $$ m \sim |t|^{1/2}\,, $$ which gives the mean-field critical exponent $\beta_{\text{MF}} = 1/2$.
Comparison with the exact solution
NMF is qualitatively wrong in 1D! The 1D Ising model with short-range interactions has no phase transition at any finite temperature. This was shown by Ising himself (1925) and can be proved exactly using the transfer matrix method. The exact free energy per site is:
$$ f_{\text{exact}} = -k_B T \ln\!\left(2\cosh\beta J\right) + \text{(boundary corrections that vanish as } N\to\infty\text{)}\,. $$The magnetization is $m = 0$ for all $T > 0$ (in the absence of an external field). The correlation length is $\xi = -1/\ln(\tanh \beta J)$, which is finite for all $T > 0$.
NMF predicts a spurious phase transition at $k_B T_c = 2J$. This is because the factorized ansatz completely ignores fluctuations, which in 1D are strong enough to destroy long-range order at any nonzero temperature.
Why does NMF fail in 1D? The factorized ansatz $q(s) = \prod_i q_i(s_i)$ sets all correlations to zero: $\avg{s_i s_j}_q - m^2 = 0$. But in 1D, thermal fluctuations create domain walls (boundaries between up and down regions) at a free-energy cost of only $2J$ per wall, offset by an entropy gain of $\sim T \ln N$ from placing the wall anywhere on the chain. Since the entropy always wins for $T > 0$, the chain disorders. NMF misses this because it does not see domain walls — those are inherently correlated structures.
This is a well-known limitation discussed in many textbooks. The VAN paper demonstrates that more expressive ansatze (autoregressive networks) can correctly capture such correlations and avoid the spurious transition.
The NMF variational free energy $\mathcal{F}_q$ always lies above the exact free energy $F$ (it is less negative), as guaranteed by $D_{\text{KL}} \geq 0$. The two curves agree at low $T$ (where both approach $-J$, the ground state energy) and diverge most around $T \sim T_c^{\text{NMF}} = 2J$.
Where does the phase transition show up in $f_q$? Since the NMF transition is second-order, $f_q(T)$ itself is continuous and its first derivative $-\partial f_q/\partial T = S$ (the entropy) is also continuous at $T_c$. The transition is invisible in the free energy curve! It only appears in the second derivative: the specific heat $C = -T\,\partial^2 f_q / \partial T^2$ has a jump discontinuity at $T_c$. Above $T_c$ the NMF gives $f_q = -k_BT\ln 2$ (a straight line, zero specific heat), while just below $T_c$ the ordered solution bends the curve, giving a finite $C$. This specific-heat jump $\Delta C = \frac{3}{2}k_B$ per spin is the classic mean-field result.
By contrast, the exact 1D free energy $f = -k_BT\ln(2\cosh\beta J)$ is analytic for all $T > 0$ — smooth to all orders, with no transition at all.
42D Ising Application
Setup and energy function
Consider an $L \times L$ square lattice with $N = L^2$ spins and periodic boundary conditions (a torus). The energy is:
where $\langle i,j \rangle$ runs over all nearest-neighbor pairs. On the square lattice, each site has coordination number $z = 4$. With periodic BCs, there are $N_{\text{bonds}} = 2N$ bonds in total (each site contributes 2 bonds: one to the right and one downward).
NMF variational free energy for the 2D lattice
Again assuming a uniform solution $m_i = m$ for all $i$ (which respects the lattice translation symmetry):
Average energy
(There are $2N$ bonds on the periodic square lattice, each contributing $-Jm^2$. Equivalently, each of the $N$ sites contributes $\frac{1}{2} z \cdot (-Jm^2) = -2Jm^2$ per site.)
Entropy
Exactly as before:
Variational free energy per site
Comparison with 1D. The only difference from the 1D case (Eq. 19) is the coefficient of $m^2$ in the energy term: $-Jm^2$ (1D, $z=2$) vs. $-2Jm^2$ (2D, $z=4$). In general, for coordination number $z$, the energy per site is $-\tfrac{z}{2}Jm^2$. The entropy per site is the same: $-k_B\mathcal{H}(m)$.
Deriving the NMF equations
Minimizing $f_q(m)$ proceeds identically to the 1D case.
- Differentiate: $$ \frac{df_q}{dm} = -4Jm + k_BT\,\tanh^{-1}(m) = 0\,. $$
- Solve: $$ m = \tanh\!\left(\frac{4J}{k_BT}\, m\right) = \tanh(4\beta J\, m)\,. $$
This is the general result (14) specialized to the 2D square lattice with $z = 4$ neighbors:
General pattern. For any lattice with coordination number $z$ and uniform coupling $J$, the NMF equation in zero field is always:
$$ m = \tanh(z\beta J\, m)\,, $$with $\beta = 1/(k_BT)$. This follows directly from Eq. (14): each site has $z$ neighbors, each contributing $Jm$ to the effective field, giving $h^{\mathrm{eff}} = zJm$. The NMF critical temperature is $k_BT_c = zJ$.
| Lattice | $d$ | $z$ | NMF equation |
|---|---|---|---|
| 1D chain | 1 | 2 | $m = \tanh(2\beta J m)$ |
| 2D square | 2 | 4 | $m = \tanh(4\beta J m)$ |
| 2D triangular | 2 | 6 | $m = \tanh(6\beta J m)$ |
| 3D cubic | 3 | 6 | $m = \tanh(6\beta J m)$ |
| $d$-dim hypercubic | $d$ | $2d$ | $m = \tanh(2d\,\beta J m)$ |
Solving for the phase transition
The analysis is identical to Section 3 but with $z = 4$ instead of $z = 2$.
Bifurcation condition: $\frac{d}{dm}\tanh(4\beta J m)\big|_{m=0} = 4\beta J = 1$, giving:
Below $T_c$: Two non-trivial solutions $m = \pm m_0(T)$ appear continuously from $m = 0$, indicating a second-order (continuous) phase transition to a ferromagnetic state.
Critical exponents: Near $T_c$, expanding as before gives $m \sim |t|^{1/2}$ with $t = (T_c - T)/T_c$. All NMF critical exponents take the classical mean-field values regardless of lattice or dimension:
Landau expansion of $f_q(m)$
For small $m$, we can expand the Shannon entropy to get a Landau-type free energy. Using $\mathcal{H}(m) = \ln 2 - \frac{m^2}{2} - \frac{m^4}{12} - \cdots$:
- $$ f_q(m) = -2Jm^2 - k_BT\left(\ln 2 - \frac{m^2}{2} - \frac{m^4}{12} - \cdots\right) $$ $$ = \text{const} + \frac{1}{2}(k_BT - 4J)m^2 + \frac{k_BT}{12} m^4 + \cdots $$
- $$ = f_0 + \frac{a}{2}\, m^2 + \frac{b}{4}\, m^4 + \cdots \tag{29} $$ with $a = k_BT - k_BT_c = k_B(T - T_c)$ where $k_BT_c = 4J$, and $b = k_BT/3 > 0$.
This is precisely the Landau free energy. For $a > 0$ ($T > T_c$), the minimum is at $m = 0$. For $a < 0$ ($T < T_c$), minima appear at $m = \pm\sqrt{-a/b} = \pm\sqrt{3(4J - k_BT)/(k_BT)}$. The transition is continuous with $m \sim |a|^{1/2}$.
Comparison with the exact Onsager solution
The 2D Ising model on the square lattice was solved exactly by Onsager (1944). The exact results are:
Critical temperature:
Critical exponent for magnetization: $\beta_{\text{exact}} = 1/8$ (Yang, 1952).
NMF overestimates $T_c$ and gets the wrong exponents.
| Quantity | NMF | Exact |
|---|---|---|
| $k_BT_c / J$ | $4.000$ | $2.269$ |
| $\beta$ (magn. exponent) | $1/2$ | $1/8$ |
| $\gamma$ (suscept. exponent) | $1$ | $7/4$ |
NMF overestimates $k_BT_c$ by a factor of $\sim 1.76$. This is because by ignoring correlations, NMF overestimates the tendency toward ordering (every spin "thinks" it has $z = 4$ perfectly aligned neighbors, rather than fluctuating ones). As the VAN paper demonstrates (Fig. 2a), more expressive variational ansatze dramatically reduce this error.
However, NMF is qualitatively correct in 2D: it correctly predicts the existence of a phase transition and the existence of spontaneous magnetization below $T_c$. This is a significant improvement over the 1D case, where NMF gives a qualitatively wrong prediction. The general rule is:
- In 1D: NMF is qualitatively wrong (predicts a transition where none exists).
- In 2D: NMF is qualitatively correct but quantitatively poor.
- In $d \geq 4$ (the upper critical dimension): NMF becomes exact for the critical exponents.
References and Further Reading
[VAN] D. Wu, L. Wang, and P. Zhang, "Solving Statistical Mechanics Using Variational Autoregressive Networks," Phys. Rev. Lett. 122, 080602 (2019). arXiv:1809.10606. — The primary reference for the variational framework and notation used here.
[Pandey–Xu] M. Pandey and M. Xu, "Mean Field Theory and its application to deep learning" (2019). mimee.xyz/meanfield.pdf. — Appendix 8.2 gives a full derivation of the NMF equations from the factorized $q(\sigma) = \prod q_i(\sigma_i)$ ansatz via KL divergence minimization, applied to the Ising model.
[Arovas] D. Arovas, "Lecture Notes on Thermodynamics and Statistical Mechanics," Ch. 7: Mean Field Theory of Phase Transitions, UCSD Physics 210a. PDF. — Comprehensive treatment using the variational density matrix $\varrho_N = \prod_i \varrho_i(\sigma_i)$.
[Sakthivadivel] D. Sakthivadivel, "Magnetisation and Mean Field Theory in the Ising Model," SciPost Phys. Lect. Notes 35 (2022). PDF. — Careful pedagogical derivation via the Bogoliubov inequality.
[Sudderth] E. Sudderth, "Lecture 9A: Variational Methods and Naive Mean Field," Brown CS242 (2016). PDF. — Clean derivation of naive MF for general pairwise MRFs from the information-theoretic/KL perspective.
[Kappen–Wiegerinck] H. J. Kappen and W. Wiegerinck, "Mean Field Theory for Graphical Models," Ch. in Advanced Mean Field Methods, MIT Press (2001). Link. — Derives NMF and TAP equations from the KL divergence / Taylor expansion viewpoint.
[Wainwright–Jordan] M. J. Wainwright and M. I. Jordan, "Graphical Models, Exponential Families, and Variational Inference," Found. Trends Mach. Learn. 1(1–2), 1–305 (2008). — The definitive reference on variational inference for graphical models. Chapter 5 covers mean field methods.
[Huang] Y.-P. Huang, "Mean-field theory," Lecture notes, NTHU Statistical Mechanics (2022). Link. — Works through the variational MFT for the Ising model using a factorizable trial Hamiltonian.
[Onsager] L. Onsager, "Crystal Statistics. I. A Two-Dimensional Model with an Order-Disorder Transition," Phys. Rev. 65, 117 (1944). — The exact solution of the 2D Ising model.