Definition 5.5 (KL Divergence) and Proposition 5.6 (Basic Properties) #
From: Rigollet, "High Dimensional Statistics", Chapter 5
Definition 5.5. The KL divergence between P and Q (both absolutely continuous w.r.t. some dominating measure μ) is: KL(P‖Q) = ∫ log(dP/dQ) dP = ∫ (dP/dμ) log(dP/dμ / dQ/dμ) dμ
Proposition 5.6. Properties: 1. KL(P‖Q) ≥ 0 (Gibbs' inequality) 2. KL(P‖Q) = 0 iff P = Q 3. KL is NOT symmetric 4. KL(P^⊗n‖Q^⊗n) = n · KL(P‖Q) (tensorization)
Definition 5.5: KL Divergence #
The KL divergence is formalized in Mathlib as InformationTheory.klDiv, which
maps two measures to ENNReal. For finite measures P, Q with P absolutely
continuous w.r.t. Q and integrable log-likelihood ratio:
klDiv P Q = ENNReal.ofReal (integral x, llr P Q x partial P + Q.real univ - P.real univ)
where llr P Q x = log ((P.rnDeriv Q x).toReal) is the log-likelihood ratio
(i.e., log(dP/dQ)).
For probability measures P, Q (where P.real univ = Q.real univ = 1), this simplifies to:
klDiv P Q = ENNReal.ofReal (integral x, log(dP/dQ(x)) dP)
which is exactly Definition 5.5 from the text.
Definition 5.5 (KL Divergence). The Kullback-Leibler divergence between
measures P and Q. This is an alias for Mathlib's InformationTheory.klDiv,
which computes KL(P, Q) as the integral of the log-likelihood ratio for
probability measures P absolutely continuous w.r.t. Q, and returns infinity
when P is not absolutely continuous w.r.t. Q.
Instances For
For probability measures, the toReal of KL divergence equals the integral
of the log-likelihood ratio, which is the formula in Definition 5.5.
Proposition 5.6: Basic Properties of KL Divergence #
Proposition 5.6(1): Gibbs' inequality (KL(P,Q) >= 0) #
Since Mathlib's klDiv takes values in ENNReal, it is nonneg by type.
The substantive content is that the real-valued integrand is itself nonneg
for finite measures with P absolutely continuous w.r.t. Q.
Proposition 5.6(1) (Gibbs' inequality, extended nonneg real form):
KL(P, Q) >= 0. Since klDiv takes values in ENNReal, this is immediate.
Proposition 5.6(1') (Gibbs' inequality, real-valued form): For finite measures P absolutely continuous w.r.t. Q with integrable log-likelihood ratio, the real-valued expression defining KL divergence is nonneg. This is the substantive mathematical content of Gibbs' inequality.
Proposition 5.6(2): KL(P,Q) = 0 iff P = Q #
Proposition 5.6(2): KL(P, Q) = 0 if and only if P = Q. The forward direction is sometimes called the "converse Gibbs' inequality".
Proposition 5.6(2a): KL(P, P) = 0 for any sigma-finite measure P.
Proposition 5.6(3): KL divergence is NOT symmetric #
We exhibit a concrete counterexample: P = 0 (zero measure) and Q = Dirac at (). Then KL(P, Q) = Q(Omega) (finite) but KL(Q, P) = infinity (since Q is not absolutely continuous w.r.t. 0).
Proposition 5.6(3): KL divergence is NOT symmetric. There exist measures P, Q such that KL(P, Q) is not equal to KL(Q, P).
Proposition 5.6(4): Tensorization / Additivity for product measures #
For independent product measures: KL(P1 x P2, Q1 x Q2) = KL(P1, Q1) + KL(P2, Q2)
The n-fold tensorization KL(P^n, Q^n) = n * KL(P, Q) follows by induction.
Book proof outline (Prop 5.6 part 2):
- Product Radon-Nikodym derivative: d(P₁⊗P₂)/d(Q₁⊗Q₂)(x₁,x₂) = (dP₁/dQ₁)(x₁) · (dP₂/dQ₂)(x₂).
- Taking logarithms: log(d(P₁⊗P₂)/d(Q₁⊗Q₂)) = log(dP₁/dQ₁) + log(dP₂/dQ₂).
- Integrating against P₁⊗P₂ and applying Fubini: KL(P₁⊗P₂ ‖ Q₁⊗Q₂) = ∫∫ [log(dP₁/dQ₁)(x₁) + log(dP₂/dQ₂)(x₂)] dP₁dP₂ = ∫ log(dP₁/dQ₁) dP₁ · (∫dP₂) + (∫dP₁) · ∫ log(dP₂/dQ₂) dP₂ = KL(P₁‖Q₁) · 1 + 1 · KL(P₂‖Q₂) [since Pᵢ are probability measures] = KL(P₁‖Q₁) + KL(P₂‖Q₂).
Formalization blockers:
- The factorization of
Measure.rnDerivfor product measures (d(μ₁.prod μ₂)/d(ν₁.prod ν₂) = (dμ₁/dν₁) ⊗ (dμ₂/dν₂)) is not yet in Mathlib. Kernel-based versions exist inProbability.Kernel.Composition, but not for plainMeasure.prod. - Without this, the integral decomposition via Fubini cannot be completed.
Auxiliary: The Radon-Nikodym derivative of a product measure factors as a product of Radon-Nikodym derivatives. This is the key missing Mathlib lemma. d(P₁.prod P₂)/d(Q₁.prod Q₂)(x₁, x₂) = (dP₁/dQ₁)(x₁) · (dP₂/dQ₂)(x₂) a.e. w.r.t. Q₁.prod Q₂.
Proposition 5.6(4) (Tensorization / Additivity): For probability measures, KL(P1 × P2, Q1 × Q2) = KL(P1, Q1) + KL(P2, Q2).
The proof proceeds in three steps following the book:
- Factor the product RN derivative (rnDeriv_prod_eq)
- Apply log to convert product → sum: log(f·g) = log(f) + log(g)
- Apply Fubini to separate the integral into two independent integrals, using ∫dPᵢ = 1 for probability measures.
The sorry blocks correspond to:
- rnDeriv_prod_eq: product RN derivative factorization (not in Mathlib)
- Fubini decomposition + log-product identity in the KL integral
Proposition 5.6 (Properties of KL divergence). (i) KL(P, Q) ≥ 0 (Gibbs' inequality). (ii) If P = P₁ ⊗ P₂ and Q = Q₁ ⊗ Q₂ are product probability measures, then KL(P,Q) = KL(P₁,Q₁) + KL(P₂,Q₂).