1. Total Variation and Pinsker's Inequality #
Pinsker's inequality relates the total variation distance to the KL divergence: TV(P,Q) ≤ √(KL(P‖Q)/2).
The direct proof (Rigollet, Lemma 5.8) proceeds via: Step 1: KL ≥ 2 − 2·BC (log inequality, where BC = Bhattacharyya coefficient) Step 2: BC² ≤ 1 − TV² (Cauchy-Schwarz) Step 3: 2 − 2√(1−TV²) ≥ TV² (pure real analysis, proved in Lemma_5_3_5_4_5_8.lean)
Step 1 (kl_ge_two_sub_two_bc) is proved from Mathlib's klFun_ge_sq_sqrt_sub_one
and integral_toReal_rnDeriv. Step 2 (bc_le_sqrt_one_sub_tv_sq) is proved from
the squared form bc_sq_le_one_sub_tv_sq (uses cauchy_schwarz_integral_sq (proved)
and TV integral representation). Both are combined
into kl_ge_two_sub_two_sqrt_one_sub_tv_sq by linarith.
Step 3 is proved directly in Lemma_5_3_5_4_5_8.lean, eliminating the
need for any chi-squared divergence infrastructure.
References:
- Rigollet, "High Dimensional Statistics", Lemma 5.8
- Cover & Thomas, "Elements of Information Theory", Lemma 11.6.1
Total variation distance between two finite measures. TV(P₀, P₁) = sup_{A measurable} |P₀(A) − P₁(A)|
Instances For
Real-valued KL divergence: (klDiv P Q).toReal.
Instances For
Bhattacharyya coefficient BC(P,Q) = ∫ √(dP/dQ) dQ.
When P ≪ Q, this equals ∫ √(pq) dν for any common dominating measure ν (where p = dP/dν, q = dQ/dν). It satisfies 0 ≤ BC ≤ 1 for probability measures.
Instances For
Step 1 (log inequality): KL(P‖Q) ≥ 2 − 2·BC(P,Q).
Proof sketch (Rigollet, Lemma 5.8, lines 3409–3420): KL(P‖Q) = ∫ log(dP/dQ) dP = −2 ∫ log(√(dQ/dP)) dP ≥ −2 ∫ (√(dQ/dP) − 1) dP (using log(x) ≤ x − 1) = 2 − 2 ∫ √(dQ/dP) dP = 2 − 2·BC(P,Q)
The formal proof below avoids the measure-change identity by using
Mathlib's toReal_klDiv_eq_integral_klFun to express KL as ∫ klFun(dP/dQ) dQ,
then applying the pointwise bound klFun(t) ≥ (√t − 1)² (proved in
klFun_ge_sq_sqrt_sub_one) under the integral sign, and expanding
∫(√f − 1)² dQ = ∫f dQ + 1 − 2·∫√f dQ = 2 − 2·BC.
The Bhattacharyya coefficient is nonneg (integral of nonneg functions).
Cauchy-Schwarz for Bochner integrals: For nonneg measurable f, g, (∫ f·g dμ)² ≤ (∫ f² dμ)(∫ g² dμ).
Proved via the classical discriminant argument: the quadratic
∫ (f + tg)² dμ = A + 2tB + t²C ≥ 0 for all t implies (2B)² - 4AC ≤ 0.
The textbook invokes this as "by Cauchy-Schwarz" without proof. Reference: Standard functional analysis (e.g., Rudin, Real and Complex Analysis).
TV integral representation (product form): For probability measures P ≪ Q with f = dP/dQ: (∫ min(f,1) dQ) · (∫ max(f,1) dQ) = 1 - TV(P,Q)².
This follows from TV = ½∫|f-1|dQ (not in Mathlib), which gives ∫ min(f,1) dQ = 1 - TV and ∫ max(f,1) dQ = 1 + TV. The textbook uses this without proof in the Cauchy-Schwarz step. Reference: Standard measure theory; Rigollet, Lemma 5.8 Step 2.
BC² + TV² ≤ 1: Core measure-theoretic content of Step 2.
Proof (Rigollet, Lemma 5.8, lines 3413–3423): Let ν = P + Q, p = dP/dν, q = dQ/dν. Then: BC² = (∫ √(pq) dν)² = (∫ √(max(p,q)·min(p,q)) dν)² ≤ (∫ max(p,q) dν) · (∫ min(p,q) dν) (by Cauchy-Schwarz) = (1 + TV(P,Q)) · (1 − TV(P,Q)) (TV representation) = 1 − TV(P,Q)²
The formal proof below works with f = dP/dQ and decomposes
√f = √min(f,1) · √max(f,1), then applies Cauchy-Schwarz
(cauchy_schwarz_integral_sq, proved via discriminant argument) and the TV representation
(integral_min_max_rnDeriv_eq, proved). The Cauchy-Schwarz sub-result is
invoked but not proved in the textbook.
Reference: Rigollet, Lemma 5.8 (Step 2, "by Cauchy-Schwarz").
Theorem (Cauchy-Schwarz for Bhattacharyya coefficient): BC(P,Q) ≤ √(1 − TV(P,Q)²).
Proved from bc_sq_le_one_sub_tv_sq (the squared form, axiomatized
because it requires L²(ν) Cauchy-Schwarz and TV integral representation
not available in Mathlib) and bhattacharyyaCoeff_nonneg.
Reference: Rigollet, Lemma 5.8 (Step 2).
Theorem (Steps 1+2 of Pinsker, Lemma 5.8): KL(P‖Q) ≥ 2 − 2√(1 − TV(P,Q)²).
Combines the two measure-theoretic steps of the direct Pinsker proof (Rigollet, Lemma 5.8, lines 3409–3429):
From kl_ge_two_sub_two_bc (Step 1): KL ≥ 2 − 2·BC
From bc_le_sqrt_one_sub_tv_sq (Step 2): BC ≤ √(1 − TV²)
Combining: −2·BC ≥ −2·√(1−TV²), so KL ≥ 2 − 2·BC ≥ 2 − 2·√(1−TV²).
Reference: Rigollet, Lemma 5.8 (Steps 1–2).
Theorem (Tight Pinsker bound): KL(P‖Q) ≥ 2·TV(P,Q)².
The standard tight Pinsker inequality. The proof via the Bhattacharyya coefficient (Steps 1–3 above) only yields the weaker KL ≥ TV²; the tight constant 2 requires the binary KL reduction:
- Data processing: For any measurable A, KL(P‖Q) ≥ kl(P(A),Q(A)) where kl(p,q) = p·log(p/q)+(1−p)·log((1−p)/(1−q)) is binary KL.
- Binary Pinsker: kl(p,q) ≥ 2·(p−q)² for all p,q ∈ [0,1]. Proof: ∂²kl/∂p² = 1/(p(1−p)) ≥ 4 (since p(1−p) ≤ 1/4), so kl is 4-strongly convex in p, giving kl(p,q) ≥ 2·(p−q)².
- Taking sup over A: KL ≥ 2·(sup_A|P(A)−Q(A)|)² = 2·TV².
Why sorry'd: Data processing for KL divergence (monotonicity under pushforward) is not yet available in Mathlib.
References: Tsybakov, "Introduction to Nonparametric Estimation", Lemma 2.5; Cover & Thomas, "Elements of Information Theory", Lemma 11.6.1.
2. Product Radon-Nikodym Derivative #
The factorization d(P₁⊗P₂)/d(Q₁⊗Q₂) = (dP₁/dQ₁)⊗(dP₂/dQ₂) is a fundamental measure-theoretic result needed for KL tensorization.
Why not in Mathlib: rnDeriv_compProd in Probability.Kernel.Composition.RadonNikodym
factors ∂(μ ⊗ₘ κ)/∂(ν ⊗ₘ η) as (∂μ/∂ν) · ∂(μ ⊗ₘ κ)/∂(μ ⊗ₘ η), and
compProd_const bridges μ.prod ν = μ ⊗ₘ Kernel.const α ν. However, the
second factor requires ∂(μ ⊗ₘ κ)/∂(μ ⊗ₘ η) = ∂κ/∂η, which is listed as
a TODO in that file and is not yet available. A uniqueness-based proof via
ae_eq_of_forall_setLIntegral_eq_of_sigmaFinite + Fubini would also work
mathematically but requires significant Fubini infrastructure on restricted
product integrals that is not easily composable in current Mathlib.
Reference: Billingsley, "Probability and Measure", Theorem 32.2.
KL Tensorization: KL(P₁⊗P₂ ‖ Q₁⊗Q₂) = KL(P₁‖Q₁) + KL(P₂‖Q₂).
Reference: Cover & Thomas, Theorem 2.7.3.
3. Fano's Inequality Components #
Fano's inequality is the main tool for minimax lower bounds via multiple hypothesis testing. The proof requires Shannon entropy, conditional entropy, mutual information, and the data processing inequality — none of which are in Mathlib.
Reference: Cover & Thomas, "Elements of Information Theory", Theorem 2.10.1; Rigollet, Theorem 5.10.
The mixture (average) measure P̄ = (1/M) Σⱼ Pⱼ.
Instances For
Fano pointwise entropy bound (discrete Fano inequality).
For a probability distribution p on Fin M with M ≥ 2, the Shannon entropy
∑ⱼ negMulLog(pⱼ) (= -∑ⱼ pⱼ · log pⱼ) is bounded by
log 2 + (1 - p i_max) · log(M - 1),
where i_max is any index (the bound is tightest when i_max achieves the maximum of p).
The proof uses Jensen's inequality for the concave function negMulLog applied
to the non-maximum terms, combined with the binary entropy bound binEntropy ≤ log 2.
Per-hypothesis posterior entropy ≤ error bound. For each hypothesis j, the Shannon entropy of the posterior distribution (P_j{ψ=k})_k is bounded by log 2 + P_j{ψ≠j} · log(M-1).
This combines three facts:
- The discrete Fano inequality
fano_pointwise_entropy_bound: H(p) ≤ log 2 + (1 - p_max) · log(M-1) - The partition property: (P_j{ψ=k})_k sums to 1 (measurable preimages partition Ω)
- The complement probability: 1 - P_j{ψ=j} = P_j{ψ≠j}
Steps 2-3 require the posterior (P_j{ψ=k})_k to be a probability distribution (sums to 1 and nonneg), which follows from the measurable preimages forming a partition of Ω.
Data processing inequality for KL divergence: deterministic coarsening (pushforward by a measurable function) can only decrease KL divergence.
For probability measures μ ≪ ν and a measurable function ψ : Ω → Fin M, KL(μ ‖ ν) ≥ ∑ₖ μ(ψ⁻¹{k}) · log(μ(ψ⁻¹{k}) / ν(ψ⁻¹{k}))
This follows from Jensen's inequality applied to the convex function t ↦ t · log(t) on each cell of the partition {ψ⁻¹(k)}.
Reference: Cover & Thomas, Theorem 2.8.1 (Data Processing Inequality).
For x ≥ 0, klFun x ≤ (x - 1)². This follows from log x ≤ x - 1.
If all pairwise KL divergences are finite, then KL to the mixture is also finite. This follows from KL(P_j, P̄) ≤ (1/M) ∑_k KL(P_j, P_k) (convexity of KL in the second argument) and the assumption that each KL(P_j, P_k) is finite.
The proof proceeds by bounding the Radon-Nikodym derivative dP_j/dP̄ ≤ M a.e.
(since P_j ≤ ∑_k P_k = M · P̄), which implies klFun(dP_j/dP̄) ≤ (M-1)² a.e.,
and therefore the lintegral of klFun against P̄ is bounded by (M-1)² < ∞.
Reference: Cover & Thomas, Theorem 2.7.2.
Algebraic identity relating KL partition sums to conditional entropy sums.
Partition sum equals 1 for a probability measure.
Data processing inequality for KL divergence (DPI, conditional entropy form).
For M probability measures P₁,...,P_M with mixture P̄ = (1/M)∑Pⱼ and a measurable function ψ : Ω → Fin M, the conditional entropy H(Z|ψ(X)) of the uniform index Z ∈ {1,...,M} given the test output ψ(X) satisfies: log M - (1/M) ∑ⱼ KL(Pⱼ‖P̄) ≤ H(Z|ψ(X))
where H(Z|ψ(X)) = ∑ₖ P̄{ψ=k} · ∑ⱼ negMulLog(Pⱼ{ψ=k} / (M · P̄{ψ=k})).
This is equivalent to the standard information-theoretic DPI: I(Z;X) ≥ I(Z;ψ(X)), which says processing cannot increase mutual information.
Proved using klDiv_ge_partition_sum (Jensen's inequality for KL on partition cells)
and algebraic_identity_for_dpi (the algebraic identity relating log M and negMulLog).
Reference: Cover & Thomas, Theorem 2.8.1 (DPI).
Weighted average of Fano bounds: if each conditional entropy term
h k ≤ log 2 + e k * log(M-1) and the weights w k ≥ 0 sum to 1,
then the weighted average ∑ w k * h k ≤ log 2 + (∑ w k * e k) * log(M-1).
Conditional Fano bound on H(Z|ψ(X)).
The conditional entropy H(Z|ψ(X)) = ∑ₖ P̄{ψ=k} · ∑ⱼ negMulLog(Pⱼ{ψ=k}/(M·P̄{ψ=k})) is bounded by log 2 + pₑ · log(M-1), where pₑ = (1/M)∑ⱼ Pⱼ{ψ≠j}.
The proof applies fano_pointwise_entropy_bound to the posterior distribution
πⱼₖ = Pⱼ{ψ=k}/(M·P̄{ψ=k}) for each outcome k, then takes a weighted average.
Fano conditional entropy bound (Steps 1-3 of Theorem 5.10 proof).
For M probability measures P₁,...,P_M with mixture P̄ = (1/M)∑Pⱼ and a measurable function ψ : Ω → Fin M: log M - (1/M)∑ⱼ KL(Pⱼ ‖ P̄) ≤ log 2 + p_e · log(M-1) where p_e = (1/M)∑ⱼ Pⱼ(ψ(ω) ≠ j) is the average error probability.
Proved by combining data_processing_kl_partition (DPI) with
cond_entropy_fano_bound (conditional Fano bound on H(Z|ψ(X))).
Fano average posterior bound (per-component form).
Algebraically equivalent to fano_conditional_entropy_bound; distributing
(1/M) over the sum gives (1/M) ∑ⱼ (log 2 + pⱼ log(M−1)) = log 2 + ((1/M) ∑ pⱼ) log(M−1).
Proved from fano_conditional_entropy_bound by this algebraic identity.
Fano entropy core bound: For any test ψ, the average error pe satisfies pe · log(M-1) + log 2 ≥ log M − (1/M) Σⱼ KL(Pⱼ ‖ P̄)
This is a direct consequence of fano_conditional_entropy_bound.
Binary Fano case (M = 2): When M = 2, log(M-1) = log 1 = 0 makes the Fano bound
degenerate. In standard mathematics this yields a vacuous bound (RHS = −∞), but Lean's
convention x / 0 = 0 produces RHS = 1, which is a non-trivial (and false) claim.
The correct bound for the binary case is simply that the average error is non-negative, which is trivially true. The textbook treats M = 2 as a degenerate case where the Neyman-Pearson approach applies instead of Fano.
Reference: Rigollet, Theorem 5.10 (M ≥ 2 with implicit degeneracy at M = 2).
Fano average error bound: The average (over j) testing error satisfies the Fano bound. This captures Steps 1–3 of the Fano proof (data processing inequality, chain rule for mutual information, binary entropy bound h(p) ≤ log 2).
Proved from fano_entropy_core_bound (the entropy-based core, axiomatized because it
requires Shannon entropy infrastructure not in Mathlib) by algebraic rearrangement.
Requires M ≥ 3 because the denominator log(M-1) must be positive: for M = 2,
log(M-1) = log 1 = 0, and in Lean x / 0 = 0, making the bound pe ≥ 1
which is false in general. All downstream applications use M ≥ 5.
Reference: Rigollet, proof of Theorem 5.10, Steps 1-3; Cover & Thomas, Theorem 2.10.1.
Fano's Lemma: The information-theoretic core of Fano's inequality.
For Z uniform on {1,...,M} and X|Z=j ~ Pⱼ, any test ψ satisfies: ∃ j, Pⱼ[ψ≠j] ≥ 1 - ((1/M)·Σⱼ KL(Pⱼ‖P̄) + log 2) / log(M-1)
Proved from fano_avg_error_bound (the average bound, proved from the axiomatized helper
fano_entropy_core_bound) by a pigeonhole argument: if the average of M values ≥ bound,
then some value ≥ bound. Requires M ≥ 3 (see fano_avg_error_bound).
Reference: Rigollet, proof of Theorem 5.10, Steps 1-3.
Log-sum inequality: for weights w summing to 1 and positive reals x, log(Σ wₖ · xₖ) ≥ Σ wₖ · log(xₖ). This is Jensen's inequality for the concave function log.
Finite Jensen inequality for -log (log-sum inequality). For positive reals a₁,...,aₘ: -log((1/M) · ∑ k, aₖ) ≤ (1/M) · ∑ k, -log(aₖ) Equivalently: log((1/M) · ∑ k, aₖ) ≥ (1/M) · ∑ k, log(aₖ)
Radon-Nikodym derivative distributes over a finite sum of probability measures.
Log-sum inequality for KL divergence integrals.
For mutually absolutely continuous probability measures P_j and P_k, ∫ llr(P_j, P̄) dP_j ≤ (1/M) ∑_k ∫ llr(P_j, P_k) dP_j, where P̄ = (1/M)∑_k P_k is the mixture measure.
This follows from:
- The Radon-Nikodym derivative identity: dP̄/dP_j = (1/M) ∑_k dP_k/dP_j a.e. [P_j],
using
Measure.rnDeriv_smul_left_of_ne_topandMeasure.rnDeriv_add. - Concavity of log (Jensen's inequality): log((1/M) ∑_k a_k) ≥ (1/M) ∑_k log(a_k) for a_k ≥ 0.
- Taking negatives and integrating: ∫ -log(dP̄/dP_j) dP_j ≤ ∫ -(1/M) ∑_k log(dP_k/dP_j) dP_j = (1/M) ∑_k ∫ -log(dP_k/dP_j) dP_j = (1/M) ∑_k ∫ log(dP_j/dP_k) dP_j.
Reference: Cover & Thomas, Theorem 2.7.1 (Log-Sum Inequality).
KL convexity in second argument (log-sum inequality).
KL(Pⱼ ‖ (1/M)∑ₖ Pₖ) ≤ (1/M) ∑ₖ KL(Pⱼ ‖ Pₖ)
This is a consequence of the joint convexity of KL divergence, equivalently the log-sum inequality (Cover & Thomas, Theorem 2.7.2). The book (Rigollet, Theorem 5.10 Step 4) uses this as a known fact without detailed proof.
The proof reduces to integral_llr_mixture_le_avg (the integrated pointwise
log-sum inequality) by rewriting both sides as integrals of llr against P_j.
Verified: only standard axioms (propext, Classical.choice, Quot.sound).
KL Convexity Bound: Average KL to mixture ≤ average pairwise KL.
(1/M) Σⱼ KL(Pⱼ ‖ P̄) ≤ (1/M²) ΣⱼΣₖ KL(Pⱼ ‖ Pₖ)
Proved from kl_convex_in_second_arg (the pointwise convexity bound,
proved via the log-sum inequality / Jensen's inequality for -log)
by summing over j and multiplying by 1/M.
Reference: Cover & Thomas, Theorem 2.7.2; Rigollet, Theorem 5.10 Step 4.
Pointwise KL bound: KL(Pⱼ ‖ P̄) ≤ log M for each j.
For each component measure Pⱼ and the mixture P̄ = (1/M)ΣₖPₖ, the KL divergence satisfies KL(Pⱼ ‖ P̄) ≤ log M. The proof uses the Radon-Nikodym derivative bound: since Pⱼ ≤ ΣₖPₖ, the RN derivative d(Pⱼ)/d(ΣₖPₖ) ≤ 1, hence d(Pⱼ)/d(P̄) = M · d(Pⱼ)/d(ΣₖPₖ) ≤ M, and log(·) ≤ log M.
Reference: Rigollet, Theorem 5.10 Step 4.
Average KL to mixture ≤ log M.
(1/M) Σⱼ KL(Pⱼ ‖ P̄) ≤ log M
The average KL divergence from components to their mixture is bounded by log
of the number of components. This is a direct consequence of the pointwise bound
klDiv_component_le_log_M. Used in Step 4 of the proof of Theorem 5.10.
Reference: Rigollet, Theorem 5.10 Step 4.
4. Reduction to Testing #
The key bridge between estimation and testing: if hypotheses are well-separated, any estimator induces a test, and estimation error implies testing error.
Reference: Rigollet, Proposition 5.2 / Section 5.2.
sqDist (θhat ·) c is measurable when θhat is measurable.
Measurability of the nearest-neighbor classifier.
For a finite set of centers θ₁,...,θ_M, the nearest-neighbor classifier ψ(Y) = argmin_j |Y - θ_j|² is measurable, since Voronoi cells are intersections of halfspaces and thus Borel measurable.
The classifier uses nnArgmin (deterministic tie-breaking via Nat.find)
rather than Exists.choose, making measurability provable.
Reference: Standard measure theory (Voronoi cells are Borel sets).
Reduction to Testing + Fano application.
Given M hypotheses θ₁,...,θ_M with:
- separation: sqDist(θⱼ,θₖ) ≥ 4ϕ for j ≠ k
- KL control: (1/M²) ΣΣ KL(Pⱼ,Pₖ) ≤ κ (average pairwise KL) Then for the probability-based minimax risk: inf_θ̂ max_j Pⱼ(|θ̂-θⱼ|² ≥ ϕ) ≥ 1 - (κ + log 2)/log(M-1)
This combines:
- Reduction to testing: if |θⱼ-θₖ|² ≥ 4ϕ, nearest-neighbor classification from any estimator θ̂ satisfies: θ̂ misclassifies ⟹ |θ̂-θⱼ|² ≥ ϕ
- Fano's inequality applied to the induced testing problem
Why not in Mathlib: The reduction to testing requires an argmin construction (nearest neighbor) on a finite metric space, plus the triangle inequality argument. Fano's inequality is also missing.
Reference: Rigollet, Theorem 5.10 + Proposition 5.2.
5. Gaussian KL Divergence #
The KL divergence between two Gaussians N(θ₁, σ²I) and N(θ₂, σ²I) with the same covariance is KL = |θ₁-θ₂|²/(2σ²).
For the Gaussian sequence model with n observations, this becomes KL(Pⱼ‖Pₖ) = n·|θⱼ-θₖ|²/(2σ²).
Reference: Rigollet, Example 5.7.
Gaussian density ratio (axiom of the GSM): For the Gaussian sequence model N(θ, (σ²/n)·I_d), the density ratio between P_θ₁ and P_θ₂ is: P_θ₁ = (P_θ₂).withDensity (fun Y => exp((n/(2σ²)) · Σᵢ [(Yᵢ-θ₂ᵢ)²-(Yᵢ-θ₁ᵢ)²]))
This is the fundamental Gaussian density ratio property from which absolute continuity and the Radon–Nikodym derivative formula follow.
Why sorry'd: The hypothesis hGSM (second-moment matching) does not uniquely
determine P_θ as a Gaussian distribution — many probability measures share the same
second moments. Proving this requires the multivariate Gaussian density formula,
which is not in Mathlib. The book does not prove this result (it is Problem 5.1(a),
an exercise). Per the formalization policy, statements whose proofs are not given
in the book may use sorry.
Reference: Standard multivariate Gaussian density; Rigollet, Problem 5.1(a).
Helper: Gaussian measures with the same covariance are mutually absolutely
continuous. Proved from gaussian_family_withDensity: since P_θ₁ is a withDensity
of P_θ₂, we have P_θ₁ ≪ P_θ₂ by withDensity_absolutelyContinuous.
Helper: The Gaussian coordinate mean property: E_θ[Y_i] = θ_i. In the GSM Y_i = θ_i + ε_i with ε_i ~ N(0, σ²/n), so E_θ[Y_i] = θ_i + E[ε_i] = θ_i. This is the fundamental mean property of the Gaussian distribution. The mean condition is taken as an explicit hypothesis since the Gaussian mean formula is not available in Mathlib for the multivariate case.
Helper: The centered mean condition for Gaussian measures:
E_θ[Y_i - θ_i] = 0. Proved from gaussian_family_coord_mean (E_θ[Y_i] = θ_i)
and gaussian_family_coord_integrable (integrability) by linearity of integrals.
Helper: The Radon-Nikodym derivative of Gaussian measures
N(θ₁, (σ²/n)·I_d) with respect to N(θ₂, (σ²/n)·I_d) is
dP_θ₁/dP_θ₂(y) = exp((n/(2σ²)) · Σᵢ [(yᵢ - θ₂ᵢ)² - (yᵢ - θ₁ᵢ)²])
almost everywhere. Proved from gaussian_family_withDensity using
Measure.rnDeriv_withDensity and absolute continuity.
Reference: Standard multivariate Gaussian density; Rigollet, Problem 5.1(a).
The pointwise log-likelihood ratio formula for Gaussians: llr(P_θ₁, P_θ₂)(Y) = (n/(2σ²)) · Σᵢ [(Y_i - θ₂_i)² - (Y_i - θ₁_i)²] a.e. This is the key Gaussian density computation from Example 5.7.
Proof: Follows from gaussian_family_rnDeriv_ae (the Gaussian density ratio
formula, proved from gaussian_family_withDensity).
Given that the Radon-Nikodym derivative is exp(f(Y)) a.e., we compute
llr = log(rnDeriv.toReal) = log(exp(f(Y))) = f(Y)
using ENNReal.toReal_ofReal and Real.log_exp.
Reference: Rigollet, Example 5.7 / Problem 5.1(a).
Helper: Coordinate centering functions are integrable under Gaussian measures. Proved: finite second moment E[(Y_i-θ_i)²]=σ²/n > 0 implies the square is integrable (since ∫ f = c > 0 forces integrability), then MemLp 2 → Integrable on a probability measure.
Helper: Per-coordinate quadratic difference is integrable. Proved: expand (Y_i-θ₂_i)² = ((Y_i-θ₁_i)+(θ₁_i-θ₂_i))² to show it's integrable under P_θ₁ (using coord_integrable and the second moment), then take the difference.
Helper: The log-likelihood ratio of two Gaussians is integrable.
Proof: By gaussian_family_llr_ae, the llr is a.e. equal to the quadratic function
(n/(2σ²)) * Σᵢ [(Y_i - θ₂_i)² - (Y_i - θ₁_i)²]. Each coordinate difference
(Y_i - θ₂_i)² - (Y_i - θ₁_i)² is integrable under P_θ₁ (by gaussian_family_quad_integrable),
so the sum (times a constant) is integrable. The llr, being a.e. equal to an integrable function,
is itself integrable (by Integrable.congr).
Gaussian LLR integral: For P_θ = N(θ, (σ²/n)·I_d), the log-likelihood ratio satisfies: (1) P_θ₁ ≪ P_θ₂, (2) llr is integrable, and (3) ∫ llr(P_θ₁, P_θ₂) dP_θ₁ = n·|θ₁-θ₂|²/(2σ²).
The proof uses sorry'd helper lemmas for Gaussian-specific properties (absolute continuity, pointwise llr formula, mean condition, integrability) which require the multivariate Gaussian density formula not in Mathlib. Given these, the proof carries out the integration computation:
- Substitute the a.e. pointwise llr formula
- Pull out the constant n/(2σ²) via
integral_const_mul - Distribute integral over the coordinate sum via
integral_finset_sum - For each coordinate i, expand (Y_i-θ₂_i)²-(Y_i-θ₁_i)² algebraically and use the mean condition E[Y_i-θ₁_i]=0 to compute the integral
- Connect the result to
sqDistviaring
Reference: Rigollet, Example 5.7 / Problem 5.1(a).
Gaussian KL: For the Gaussian sequence model, KL(P_{θ₁} ‖ P_{θ₂}) = n·|θ₁-θ₂|²/(2σ²).
Proved from gaussian_family_llr_integral (axiomatized because it requires
multivariate Gaussian density not in Mathlib) and the definition of KL divergence
(klDiv_of_ac_of_integrable): for probability measures P ≪ Q with integrable llr,
klDiv P Q = ofReal(∫ llr P Q ∂P)
so (klDiv P Q).toReal = ∫ llr P Q ∂P (since KL ≥ 0).
Reference: Rigollet, Example 5.7.
6. Varshamov-Gilbert Lemma #
The probabilistic method construction for Hamming-separated binary vectors.
Reference: Rigollet, Lemma 5.12; Varshamov (1957), Gilbert (1952).
Hamming distance is symmetric.
Hoeffding pair-level bound infrastructure #
We prove the pair-level Hoeffding bound by:
- Setting up the uniform product measure on
Fin d → Bool - Applying
hoeffding_sum_lower_tailto centered coordinate indicators - Converting the probability bound back to a counting bound
- Reducing the pair-space problem to the single-string weight problem via XOR
The uniform probability measure on Bool.
Instances For
The centered indicator: maps true ↦ 1/2, false ↦ -1/2.
Instances For
The mean of boolCenter under the uniform measure on Bool is zero.
Under the pi measure, the integral of a coordinate function reduces to the single-variable integral.
Coordinate centered indicators are independent under the pi measure.
Hoeffding's inequality (Theorem 1.9, Rigollet) applied to the pair space. For two independent uniform d-bit vectors, the probability that their Hamming distance is less than (1/2 - γ)d is at most exp(-2γ²d).
Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality).
Hoeffding's inequality (Theorem 1.9, Rigollet) in counting form for a single pair. For independent uniform random d-bit vectors, the probability that their Hamming distance is less than (1/2 - γ)d is at most exp(-2γ²d). In counting terms, for a fixed pair (j,k), the number of M-tuples where pair (j,k) is "bad" is at most exp(-2γ²d) × total.
The proof uses a fiber decomposition: an M-tuple ω is decomposed into the pair (ω j, ω k) and the remaining M-2 coordinates. The bad-pair count factors as (bad pairs on pair space) × (size of fiber), and similarly for the total count. The ratio equals the pair-level ratio, which is bounded by Hoeffding.
Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality).
Hoeffding + union bound for M-tuples of binary vectors (Theorem 1.9, Rigollet).
By Hoeffding's inequality (Theorem 1.9, Rigollet), each pair of independent uniform random binary vectors has P(Hamming distance < (1/2-γ)d) ≤ exp(-2γ²d). The union bound over M(M-1)/2 unordered pairs gives: P(∃ bad pair) ≤ M(M-1)/2 · exp(-2γ²d) < 1 when M(M-1) < 2·exp(2γ²d). In the finite counting model on (Fin M → Fin d → Bool), this means the number of M-tuples with at least one bad pair is strictly less than the total number of M-tuples.
The proof uses the union bound over strictly ordered pairs (j < k) combined
with hoeffding_pair_count_bound (Theorem 1.9, sorry'd as it requires
Hoeffding's inequality which is not yet in Mathlib).
Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality) + union bound.
The probabilistic method via Hoeffding's inequality (Theorem 1.9, Rigollet): for M i.i.d. uniform binary vectors of length d, if M(M-1) < 2·exp(2γ²d), then the union bound + Hoeffding show P(all pairs separated) > 0, so separated vectors exist.
The proof proceeds as follows (Rigollet, Lemma 5.12):
- Sample ω_{j,i} i.i.d. Bernoulli(1/2) for 1 ≤ i ≤ d, 1 ≤ j ≤ M.
- For any pair j ≠ k, d - ρ(ω_j, ω_k) ~ Bin(d, 1/2).
- By Hoeffding's inequality (Theorem 1.9): P(ρ(ω_j,ω_k) < (1/2-γ)d) = P(X - d/2 > γd) ≤ exp(-2γ²d).
- Union bound: P(∃ j≠k bad) ≤ M(M-1)/2 · exp(-2γ²d) < 1 when M(M-1) < 2exp(2γ²d).
- Hence P(all pairs good) > 0, and the desired vectors exist.
Proved from hoeffding_union_bound_count (which axiomatizes steps 1-4 via
Hoeffding's inequality, Theorem 1.9) by extracting the existential (step 5).
Reference: Rigollet, Lemma 5.12 proof; applies Theorem 1.9 (Hoeffding).
Varshamov-Gilbert: For γ ∈ (0,1/2), ∃ M ≥ exp(γ²d/2) binary vectors with pairwise Hamming distance ≥ (1/2-γ)d.
Proof: Choose M = ⌈exp(γ²d/2)⌉₊ and verify M(M-1) < 2·exp(2γ²d),
then apply probabilistic_method_separated_vectors.
Reference: Rigollet, Lemma 5.12.
7. Sparse Varshamov-Gilbert Lemma #
A variant producing weight-k binary vectors with Hamming separation.
Reference: Rigollet, Lemma 5.14.
Sparse Varshamov-Gilbert: For 1 ≤ k ≤ d/8, ∃ M binary vectors with weight k, pairwise Hamming distance ≥ k/2, and log M ≥ (k/8)·log(1+d/(2k)).
Proof: Greedy volume-counting on weight-k binary vectors. The excluded zone per vector has size bounded by Σ C(k,r)·C(d-k,r), and the total count C(d,k) forces enough well-separated vectors.
Why not in Mathlib: Binomial coefficient bounds C(d,k) ≥ (d/k)^k, the volume bound on restricted Hamming balls, and well-founded recursion for the greedy construction are not available.
Reference: Rigollet, Lemma 5.14.
8. Markov Bridge and Markov Step #
Converts probability-based minimax lower bounds to expectation-based lower bounds. If P(X ≥ ϕ) ≥ p, then E[X] ≥ p·ϕ (for nonneg X).
Reference: Standard (Markov's inequality converse).
Theorem (Markov step): For a nonneg integrable function f, if P(f ≥ ϕ) ≥ p, then E[f] ≥ p·ϕ.
Proof: By Markov's inequality (mul_meas_ge_le_integral_of_nonneg),
ϕ · μ.real{f ≥ ϕ} ≤ ∫ f dμ.
Since p ≤ μ.real{f ≥ ϕ} (from hprob) and ϕ > 0:
p·ϕ ≤ μ.real{f ≥ ϕ}·ϕ = ϕ·μ.real{f ≥ ϕ} ≤ ∫ f dμ.
Theorem (Markov bridge, strong form): If for every estimator, there exists θ such that P_θ(|θ̂-θ|² ≥ ϕ) ≥ p, then inf_{θ̂} sup_θ E_θ[|θ̂-θ|²] ≥ p·ϕ.
The proof structure is:
le_ciInf: reduce to showing inequality for arbitrary θ̂- Extract witness θ₀ from
hprob - Markov step: p·ϕ ≤ ∫ sqDist via
markov_step(combines ϕ·P(f≥ϕ) ≤ ∫f with p ≤ P(f≥ϕ)) - Sup step: lift to
⨆ θ ∈ Θviale_ciSup_of_le
Theorem (Markov bridge): If for every estimator, there exists θ such that P_θ(|θ̂-θ|² ≥ ϕ) ≥ p, then inf_{θ̂} sup_θ E_θ[|θ̂-θ|²] ≥ p·ϕ.
Proved from markov_bridge_strong.
Hamming ball volume × code size bound (Chernoff/entropy bound) #
For d ≥ 2 and any center ω₀ : Fin d → Bool, the Hamming ball of radius < d/4 has cardinality satisfying: ball_volume × max(⌈exp(d/8)⌉₊, 3) < 2^d
This encapsulates the Chernoff/entropy bound from Lemma 5.12: the binary entropy H(1/4) ≈ 0.81 satisfies H(1/4) + 1/(8·ln 2) < 1, so ball_volume ≤ 2^{d·H(1/4)} and M ≤ 2^{d/(8·ln 2)}, giving ball_volume × M ≤ 2^{d·(H(1/4) + 1/(8·ln 2))} < 2^d.
Reference: Rigollet, Lemma 5.12 proof (Chernoff/entropy bound on ball volume).
XOR bijection for Hamming balls: the XOR map ω ↦ ω ⊕ ω₀ is a bijection
from the Hamming ball around ω₀ to the Hamming ball around the zero vector.
This shows ball cardinality is center-independent.
For d ≥ 2, the numerical bound max(⌈exp(d/8)⌉₊, 3) < 2^d.
Since 1/8 < ln 2, we have exp(d/8) < 2^d. The proof chains:
⌈exp(d/8)⌉₊ < exp(d/8) + 1 ≤ 2 · exp(d/8) = exp(ln 2 + d/8) ≤ exp(d · ln 2) = 2^d,
where the last step uses ln 2 + d/8 ≤ d · ln 2 (equivalently ln 2 ≤ d·(ln 2 − 1/8),
which holds for d ≥ 2 since ln 2 > 1/4).
Chernoff/tilting bound on Hamming ball cardinality.
For the binary Hamming ball of radius < d/4 around the zero center, the cardinality
satisfies ball_card * max(⌈exp(d/8)⌉₊, 3) < 2^d.
The proof uses the tilting method (Chernoff bound): the ball volume is bounded via
the binomial theorem, giving ball_card * 3^(d-d/4) ≤ 4^d.
Combined with the real inequality 2^d * (exp(d/8) + 1) < 3^(d-d/4) for d ≥ 19
(proved by induction with step 4), this yields the result for large d.
Small d (2 ≤ d ≤ 18) is verified computationally via native_decide.
The numerical bound max(⌈exp(d/8)⌉₊, 3) < 2^d is proved in
max_ceil_exp_div8_lt_two_pow. The ball volume bound requires establishing
the entropy bound on partial binomial coefficient sums, which in turn needs
Stirling-type estimates or the Chernoff method for binomial tails. These tools
are not currently available in Mathlib.
Reference: Cover & Thomas, Elements of Information Theory, Theorem 11.1.2; Rigollet, Lemma 5.12 proof.
Hamming ball exclusion bound (Hoeffding + ball volume counting).
For any set S of binary vectors of length d with |S| + 1 ≤ max(⌈exp(d/8)⌉₊ + 1, 4), the union of Hamming balls of radius < (1/2 - 1/4) * d around elements of S does not cover the full space {0,1}^d.
The proof reduces to the ball volume bound hamming_ball_volume_times_M_lt via the
union bound: |⋃ ball(ω₀)| ≤ ∑ |ball(ω₀)| ≤ |S| × max_ball ≤ M × max_ball < 2^d.
Reference: Rigollet, Lemma 5.12 proof (Hoeffding bound on ball volume).
Verified: No sorry; axioms = {propext, Classical.choice, Quot.sound}.
Greedy step: given a set S of binary vectors with |S| + 1 ≤ M, there exists a
vector far from all elements of S (Hamming distance ≥ (1/2 - 1/4) * d).
Proved from hamming_greedy_excluded_bound.
Greedy construction of n pairwise-separated binary vectors.
For n ≤ max(⌈exp(d/8)⌉₊ + 1, 4), there exist n binary vectors of length d
with pairwise Hamming distance ≥ (1/2 - 1/4) * d.
Proved by induction using greedy_step.
Greedy construction of well-separated binary vectors (Hoeffding + ball volume).
For M = max(⌈exp(d/8)⌉₊ + 1, 4), there exist M binary vectors of length d with pairwise Hamming distance ≥ (1/2 - 1/4) * d.
Mathematical argument (greedy/iterative Varshamov-Gilbert):
- At step k of the greedy construction, each of the k-1 existing vectors excludes at most exp(-d/8) · 2^d vectors from {0,1}^d (Hoeffding ball volume bound).
- The construction succeeds as long as (k-1) · exp(-d/8) · 2^d < 2^d, i.e., k < exp(d/8) + 1.
- This yields M ≥ exp(d/8) + 1 well-separated vectors.
- The max with 4 ensures the downstream 4 ≤ M requirement for Corollary 5.13.
Proved from hamming_greedy_excluded_bound (via hamming_ball_volume_times_M_lt axiom,
Chernoff/entropy bound) via greedy_step and exists_separated_vectors_greedy.
Reference: Rigollet, Lemma 5.12 (greedy variant) + Corollary 5.13 proof.
Helper (sorry'd): Strengthened probabilistic method for VG construction.
The standard probabilistic method (probabilistic_method_separated_vectors)
constructs M binary vectors with M ≈ exp(γ²d/2). For Corollary 5.13 we need
a construction with M large enough to satisfy both 4 ≤ M and
log(M-1) ≥ d/8, while maintaining pairwise Hamming distance ≥ d/4.
This combines the Hoeffding-based probabilistic method with a more careful choice of M. The mathematical justification is the same probabilistic method (Hoeffding's inequality + counting), but with a tighter analysis.
Why sorry: Same foundation as hoeffding_pair_count_bound (Theorem 1.9,
Hoeffding's inequality, not in Mathlib).
Reference: Rigollet, Lemma 5.12 + Corollary 5.13 proof.
Varshamov-Gilbert for Corollary 5.13: Specialized VG with γ = 1/4 providing
4 ≤ M, log(M-1) ≥ d/8, and pairwise Hamming distance ≥ d/4.
Proved from vg_strong_construction (sorry'd, same Hoeffding foundation).
Reference: Rigollet, Lemma 5.12 + proof of Corollary 5.13.