Documentation

Atlas.HighDimensionalStatistics.code.Chapter5.InfoTheory

1. Total Variation and Pinsker's Inequality #

Pinsker's inequality relates the total variation distance to the KL divergence: TV(P,Q) ≤ √(KL(P‖Q)/2).

The direct proof (Rigollet, Lemma 5.8) proceeds via: Step 1: KL ≥ 2 − 2·BC (log inequality, where BC = Bhattacharyya coefficient) Step 2: BC² ≤ 1 − TV² (Cauchy-Schwarz) Step 3: 2 − 2√(1−TV²) ≥ TV² (pure real analysis, proved in Lemma_5_3_5_4_5_8.lean)

Step 1 (kl_ge_two_sub_two_bc) is proved from Mathlib's klFun_ge_sq_sqrt_sub_one and integral_toReal_rnDeriv. Step 2 (bc_le_sqrt_one_sub_tv_sq) is proved from the squared form bc_sq_le_one_sub_tv_sq (uses cauchy_schwarz_integral_sq (proved) and TV integral representation). Both are combined into kl_ge_two_sub_two_sqrt_one_sub_tv_sq by linarith. Step 3 is proved directly in Lemma_5_3_5_4_5_8.lean, eliminating the need for any chi-squared divergence infrastructure.

References:

Rigollet, "High Dimensional Statistics", Lemma 5.8
Cover & Thomas, "Elements of Information Theory", Lemma 11.6.1

noncomputable def InfoTheory.tvDist {Ω : Type u_1} [MeasurableSpace Ω] (P₀ P₁ : MeasureTheory.Measure Ω) :

Total variation distance between two finite measures. TV(P₀, P₁) = sup_{A measurable} |P₀(A) − P₁(A)|

Instances For

noncomputable def InfoTheory.klDiv_real {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) :

Real-valued KL divergence: (klDiv P Q).toReal.

Instances For

noncomputable def InfoTheory.bhattacharyyaCoeff {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) :

Bhattacharyya coefficient BC(P,Q) = ∫ √(dP/dQ) dQ.

When P ≪ Q, this equals ∫ √(pq) dν for any common dominating measure ν (where p = dP/dν, q = dQ/dν). It satisfies 0 ≤ BC ≤ 1 for probability measures.

Instances For

theorem InfoTheory.klFun_ge_sq_sqrt_sub_one {t : ℝ} (ht : 0 ≤ t) :

InformationTheory.klFun t ≥ (√t - 1) ^ 2

Step 1 (log inequality): KL(P‖Q) ≥ 2 − 2·BC(P,Q).

Proof sketch (Rigollet, Lemma 5.8, lines 3409–3420): KL(P‖Q) = ∫ log(dP/dQ) dP = −2 ∫ log(√(dQ/dP)) dP ≥ −2 ∫ (√(dQ/dP) − 1) dP (using log(x) ≤ x − 1) = 2 − 2 ∫ √(dQ/dP) dP = 2 − 2·BC(P,Q)

The formal proof below avoids the measure-change identity by using Mathlib's toReal_klDiv_eq_integral_klFun to express KL as ∫ klFun(dP/dQ) dQ, then applying the pointwise bound klFun(t) ≥ (√t − 1)² (proved in klFun_ge_sq_sqrt_sub_one) under the integral sign, and expanding ∫(√f − 1)² dQ = ∫f dQ + 1 − 2·∫√f dQ = 2 − 2·BC.

theorem InfoTheory.kl_ge_two_sub_two_bc {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) (hfin : InformationTheory.klDiv P Q ≠ ⊤) :

klDiv_real P Q ≥ 2 - 2 * bhattacharyyaCoeff P Q

theorem InfoTheory.bhattacharyyaCoeff_nonneg {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) :

0 ≤ bhattacharyyaCoeff P Q

The Bhattacharyya coefficient is nonneg (integral of nonneg functions).

theorem InfoTheory.cauchy_schwarz_integral_sq {Ω : Type u_1} [MeasurableSpace Ω] (μ : MeasureTheory.Measure Ω) (f g : Ω → ℝ) (hf_nn : ∀ (ω : Ω), 0 ≤ f ω) (hg_nn : ∀ (ω : Ω), 0 ≤ g ω) (hf_int : MeasureTheory.Integrable (fun (ω : Ω) => f ω ^ 2) μ) (hg_int : MeasureTheory.Integrable (fun (ω : Ω) => g ω ^ 2) μ) (hfg_int : MeasureTheory.Integrable (fun (ω : Ω) => f ω * g ω) μ) :

(∫ (ω : Ω), f ω * g ω ∂μ) ^ 2 ≤ (∫ (ω : Ω), f ω ^ 2 ∂μ) * ∫ (ω : Ω), g ω ^ 2 ∂μ

Cauchy-Schwarz for Bochner integrals: For nonneg measurable f, g, (∫ f·g dμ)² ≤ (∫ f² dμ)(∫ g² dμ).

Proved via the classical discriminant argument: the quadratic ∫ (f + tg)² dμ = A + 2tB + t²C ≥ 0 for all t implies (2B)² - 4AC ≤ 0.

The textbook invokes this as "by Cauchy-Schwarz" without proof. Reference: Standard functional analysis (e.g., Rudin, Real and Complex Analysis).

theorem InfoTheory.integral_min_max_rnDeriv_eq {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) :

(∫ (ω : Ω), min (P.rnDeriv Q ω).toReal 1 ∂Q) * ∫ (ω : Ω), max (P.rnDeriv Q ω).toReal 1 ∂Q = 1 - tvDist P Q ^ 2

TV integral representation (product form): For probability measures P ≪ Q with f = dP/dQ: (∫ min(f,1) dQ) · (∫ max(f,1) dQ) = 1 - TV(P,Q)².

This follows from TV = ½∫|f-1|dQ (not in Mathlib), which gives ∫ min(f,1) dQ = 1 - TV and ∫ max(f,1) dQ = 1 + TV. The textbook uses this without proof in the Cauchy-Schwarz step. Reference: Standard measure theory; Rigollet, Lemma 5.8 Step 2.

theorem InfoTheory.bc_sq_le_one_sub_tv_sq {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) :

bhattacharyyaCoeff P Q ^ 2 ≤ 1 - tvDist P Q ^ 2

BC² + TV² ≤ 1: Core measure-theoretic content of Step 2.

Proof (Rigollet, Lemma 5.8, lines 3413–3423): Let ν = P + Q, p = dP/dν, q = dQ/dν. Then: BC² = (∫ √(pq) dν)² = (∫ √(max(p,q)·min(p,q)) dν)² ≤ (∫ max(p,q) dν) · (∫ min(p,q) dν) (by Cauchy-Schwarz) = (1 + TV(P,Q)) · (1 − TV(P,Q)) (TV representation) = 1 − TV(P,Q)²

The formal proof below works with f = dP/dQ and decomposes √f = √min(f,1) · √max(f,1), then applies Cauchy-Schwarz (cauchy_schwarz_integral_sq, proved via discriminant argument) and the TV representation (integral_min_max_rnDeriv_eq, proved). The Cauchy-Schwarz sub-result is invoked but not proved in the textbook.

Reference: Rigollet, Lemma 5.8 (Step 2, "by Cauchy-Schwarz").

theorem InfoTheory.bc_le_sqrt_one_sub_tv_sq {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) :

bhattacharyyaCoeff P Q ≤ √(1 - tvDist P Q ^ 2)

Theorem (Cauchy-Schwarz for Bhattacharyya coefficient): BC(P,Q) ≤ √(1 − TV(P,Q)²).

Proved from bc_sq_le_one_sub_tv_sq (the squared form, axiomatized because it requires L²(ν) Cauchy-Schwarz and TV integral representation not available in Mathlib) and bhattacharyyaCoeff_nonneg.

Reference: Rigollet, Lemma 5.8 (Step 2).

theorem InfoTheory.kl_ge_two_sub_two_sqrt_one_sub_tv_sq {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) (hfin : InformationTheory.klDiv P Q ≠ ⊤) :

klDiv_real P Q ≥ 2 - 2 * √(1 - tvDist P Q ^ 2)

Theorem (Steps 1+2 of Pinsker, Lemma 5.8): KL(P‖Q) ≥ 2 − 2√(1 − TV(P,Q)²).

Combines the two measure-theoretic steps of the direct Pinsker proof (Rigollet, Lemma 5.8, lines 3409–3429):

From kl_ge_two_sub_two_bc (Step 1): KL ≥ 2 − 2·BC From bc_le_sqrt_one_sub_tv_sq (Step 2): BC ≤ √(1 − TV²)

Combining: −2·BC ≥ −2·√(1−TV²), so KL ≥ 2 − 2·BC ≥ 2 − 2·√(1−TV²).

Reference: Rigollet, Lemma 5.8 (Steps 1–2).

theorem InfoTheory.kl_ge_two_mul_tv_sq {Ω : Type u_1} [MeasurableSpace Ω] (P Q : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure P] [MeasureTheory.IsProbabilityMeasure Q] (hac : P.AbsolutelyContinuous Q) (hfin : InformationTheory.klDiv P Q ≠ ⊤) :

klDiv_real P Q ≥ 2 * tvDist P Q ^ 2

Theorem (Tight Pinsker bound): KL(P‖Q) ≥ 2·TV(P,Q)².

The standard tight Pinsker inequality. The proof via the Bhattacharyya coefficient (Steps 1–3 above) only yields the weaker KL ≥ TV²; the tight constant 2 requires the binary KL reduction:

Data processing: For any measurable A, KL(P‖Q) ≥ kl(P(A),Q(A)) where kl(p,q) = p·log(p/q)+(1−p)·log((1−p)/(1−q)) is binary KL.
Binary Pinsker: kl(p,q) ≥ 2·(p−q)² for all p,q ∈ [0,1]. Proof: ∂²kl/∂p² = 1/(p(1−p)) ≥ 4 (since p(1−p) ≤ 1/4), so kl is 4-strongly convex in p, giving kl(p,q) ≥ 2·(p−q)².
Taking sup over A: KL ≥ 2·(sup_A|P(A)−Q(A)|)² = 2·TV².

Why sorry'd: Data processing for KL divergence (monotonicity under pushforward) is not yet available in Mathlib.

References: Tsybakov, "Introduction to Nonparametric Estimation", Lemma 2.5; Cover & Thomas, "Elements of Information Theory", Lemma 11.6.1.

2. Product Radon-Nikodym Derivative #

The factorization d(P₁⊗P₂)/d(Q₁⊗Q₂) = (dP₁/dQ₁)⊗(dP₂/dQ₂) is a fundamental measure-theoretic result needed for KL tensorization.

Why not in Mathlib: rnDeriv_compProd in Probability.Kernel.Composition.RadonNikodym factors ∂(μ ⊗ₘ κ)/∂(ν ⊗ₘ η) as (∂μ/∂ν) · ∂(μ ⊗ₘ κ)/∂(μ ⊗ₘ η), and compProd_const bridges μ.prod ν = μ ⊗ₘ Kernel.const α ν. However, the second factor requires ∂(μ ⊗ₘ κ)/∂(μ ⊗ₘ η) = ∂κ/∂η, which is listed as a TODO in that file and is not yet available. A uniqueness-based proof via ae_eq_of_forall_setLIntegral_eq_of_sigmaFinite + Fubini would also work mathematically but requires significant Fubini infrastructure on restricted product integrals that is not easily composable in current Mathlib.

Reference: Billingsley, "Probability and Measure", Theorem 32.2.

theorem InfoTheory.rnDeriv_prod_eq {α : Type u_1} {β : Type u_2} [MeasurableSpace α] [MeasurableSpace β] {P1 Q1 : MeasureTheory.Measure α} {P2 Q2 : MeasureTheory.Measure β} [MeasureTheory.SigmaFinite P1] [MeasureTheory.SigmaFinite Q1] [MeasureTheory.SigmaFinite P2] [MeasureTheory.SigmaFinite Q2] (hac1 : P1.AbsolutelyContinuous Q1) (hac2 : P2.AbsolutelyContinuous Q2) :

(P1.prod P2).rnDeriv (Q1.prod Q2) =ᵐ[Q1.prod Q2] fun (x : α × β) => P1.rnDeriv Q1 x.1 * P2.rnDeriv Q2 x.2

theorem InfoTheory.klDiv_prod_eq {α : Type u_1} {β : Type u_2} [MeasurableSpace α] [MeasurableSpace β] {P1 Q1 : MeasureTheory.Measure α} {P2 Q2 : MeasureTheory.Measure β} [MeasureTheory.IsProbabilityMeasure P1] [MeasureTheory.IsProbabilityMeasure Q1] [MeasureTheory.IsProbabilityMeasure P2] [MeasureTheory.IsProbabilityMeasure Q2] (hPQ1 : P1.AbsolutelyContinuous Q1) (hPQ2 : P2.AbsolutelyContinuous Q2) (h_int1 : MeasureTheory.Integrable (MeasureTheory.llr P1 Q1) P1) (h_int2 : MeasureTheory.Integrable (MeasureTheory.llr P2 Q2) P2) :

InformationTheory.klDiv (P1.prod P2) (Q1.prod Q2) = InformationTheory.klDiv P1 Q1 + InformationTheory.klDiv P2 Q2

KL Tensorization: KL(P₁⊗P₂ ‖ Q₁⊗Q₂) = KL(P₁‖Q₁) + KL(P₂‖Q₂).

Reference: Cover & Thomas, Theorem 2.7.3.

3. Fano's Inequality Components #

Fano's inequality is the main tool for minimax lower bounds via multiple hypothesis testing. The proof requires Shannon entropy, conditional entropy, mutual information, and the data processing inequality — none of which are in Mathlib.

Reference: Cover & Thomas, "Elements of Information Theory", Theorem 2.10.1; Rigollet, Theorem 5.10.

noncomputable def InfoTheory.mixtureMeasure {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (P : Fin M → MeasureTheory.Measure Ω) :

MeasureTheory.Measure Ω

The mixture (average) measure P̄ = (1/M) Σⱼ Pⱼ.

Instances For

theorem InfoTheory.fano_pointwise_entropy_bound (M : ℕ) (hM : 2 ≤ M) (p : Fin M → ℝ) (hp_nn : ∀ (j : Fin M), 0 ≤ p j) (hp_sum : ∑ j : Fin M, p j = 1) (i_max : Fin M) :

∑ j : Fin M, (p j).negMulLog ≤ Real.log 2 + (1 - p i_max) * Real.log (↑M - 1)

Fano pointwise entropy bound (discrete Fano inequality). For a probability distribution p on Fin M with M ≥ 2, the Shannon entropy ∑ⱼ negMulLog(pⱼ) (= -∑ⱼ pⱼ · log pⱼ) is bounded by log 2 + (1 - p i_max) · log(M - 1), where i_max is any index (the bound is tightest when i_max achieves the maximum of p).

The proof uses Jensen's inequality for the concave function negMulLog applied to the non-maximum terms, combined with the binary entropy bound binEntropy ≤ log 2.

theorem InfoTheory.posterior_entropy_le_error_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (ψ : Ω → Fin M) (hψ : Measurable ψ) (j : Fin M) :

∑ k : Fin M, ((P j) {ω : Ω | ψ ω = k}).toReal.negMulLog ≤ Real.log 2 + ((P j) {ω : Ω | ψ ω ≠ j}).toReal * Real.log (↑M - 1)

Per-hypothesis posterior entropy ≤ error bound. For each hypothesis j, the Shannon entropy of the posterior distribution (P_j{ψ=k})_k is bounded by log 2 + P_j{ψ≠j} · log(M-1).

This combines three facts:

The discrete Fano inequality fano_pointwise_entropy_bound: H(p) ≤ log 2 + (1 - p_max) · log(M-1)
The partition property: (P_j{ψ=k})_k sums to 1 (measurable preimages partition Ω)
The complement probability: 1 - P_j{ψ=j} = P_j{ψ≠j}

Steps 2-3 require the posterior (P_j{ψ=k})_k to be a probability distribution (sums to 1 and nonneg), which follows from the measurable preimages forming a partition of Ω.

theorem InfoTheory.rnDeriv_restrict_ae_eq_kl {Ω : Type u_1} [MeasurableSpace Ω] {μ ν : MeasureTheory.Measure Ω} [MeasureTheory.SigmaFinite ν] [μ.HaveLebesgueDecomposition ν] (hac : μ.AbsolutelyContinuous ν) {s : Set Ω} (hs : MeasurableSet s) :

(μ.restrict s).rnDeriv (ν.restrict s) =ᵐ[ν.restrict s] μ.rnDeriv ν

Data processing inequality for KL divergence: deterministic coarsening (pushforward by a measurable function) can only decrease KL divergence.

For probability measures μ ≪ ν and a measurable function ψ : Ω → Fin M, KL(μ ‖ ν) ≥ ∑ₖ μ(ψ⁻¹{k}) · log(μ(ψ⁻¹{k}) / ν(ψ⁻¹{k}))

This follows from Jensen's inequality applied to the convex function t ↦ t · log(t) on each cell of the partition {ψ⁻¹(k)}.

Reference: Cover & Thomas, Theorem 2.8.1 (Data Processing Inequality).

theorem InfoTheory.klDiv_restrict_eq_set_lintegral_kl {Ω : Type u_1} [MeasurableSpace Ω] {μ ν : MeasureTheory.Measure Ω} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] (hac : μ.AbsolutelyContinuous ν) {s : Set Ω} (hs : MeasurableSet s) :

InformationTheory.klDiv (μ.restrict s) (ν.restrict s) = ∫⁻ (x : Ω) in s, ENNReal.ofReal (InformationTheory.klFun (μ.rnDeriv ν x).toReal) ∂ν

theorem InfoTheory.klDiv_restrict_le_kl {Ω : Type u_1} [MeasurableSpace Ω] {μ ν : MeasureTheory.Measure Ω} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] (hac : μ.AbsolutelyContinuous ν) {s : Set Ω} (hs : MeasurableSet s) :

InformationTheory.klDiv (μ.restrict s) (ν.restrict s) ≤ InformationTheory.klDiv μ ν

theorem InfoTheory.klDiv_eq_sum_restrict_kl {Ω : Type u_1} [MeasurableSpace Ω] {μ ν : MeasureTheory.Measure Ω} [MeasureTheory.IsFiniteMeasure μ] [MeasureTheory.IsFiniteMeasure ν] (hac : μ.AbsolutelyContinuous ν) {M : ℕ} (ψ : Ω → Fin M) (hψ : Measurable ψ) :

∑ k : Fin M, InformationTheory.klDiv (μ.restrict (ψ ⁻¹' {k})) (ν.restrict (ψ ⁻¹' {k})) = InformationTheory.klDiv μ ν

theorem InfoTheory.partition_sum_eq_one_kl {Ω : Type u_1} [MeasurableSpace Ω] (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] {M : ℕ} (ψ : Ω → Fin M) (hψ : Measurable ψ) :

∑ k : Fin M, (μ (ψ ⁻¹' {k})).toReal = 1

theorem InfoTheory.klDiv_ge_partition_sum {Ω : Type u_1} [MeasurableSpace Ω] (μ ν : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] [MeasureTheory.IsProbabilityMeasure ν] (hac : μ.AbsolutelyContinuous ν) (hfin : InformationTheory.klDiv μ ν ≠ ⊤) (M : ℕ) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

(InformationTheory.klDiv μ ν).toReal ≥ ∑ k : Fin M, (μ (ψ ⁻¹' {k})).toReal * Real.log ((μ (ψ ⁻¹' {k})).toReal / (ν (ψ ⁻¹' {k})).toReal)

theorem InfoTheory.klFun_le_sq_sub (x : ℝ) (hx : 0 ≤ x) :

InformationTheory.klFun x ≤ (x - 1) ^ 2

For x ≥ 0, klFun x ≤ (x - 1)². This follows from log x ≤ x - 1.

theorem InfoTheory.klDiv_mixture_ne_top {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (_hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (_hac : ∀ (j k : Fin M), (P j).AbsolutelyContinuous (P k)) (hfin : ∀ (j k : Fin M), InformationTheory.klDiv (P j) (P k) ≠ ⊤) (j : Fin M) :

InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤

If all pairwise KL divergences are finite, then KL to the mixture is also finite. This follows from KL(P_j, P̄) ≤ (1/M) ∑_k KL(P_j, P_k) (convexity of KL in the second argument) and the assumption that each KL(P_j, P_k) is finite.

The proof proceeds by bounding the Radon-Nikodym derivative dP_j/dP̄ ≤ M a.e. (since P_j ≤ ∑_k P_k = M · P̄), which implies klFun(dP_j/dP̄) ≤ (M-1)² a.e., and therefore the lintegral of klFun against P̄ is bounded by (M-1)² < ∞.

Reference: Cover & Thomas, Theorem 2.7.2.

theorem InfoTheory.negMulLog_term_identity_dpi (a b : ℝ) (M : ℕ) (hM : 2 ≤ M) (ha : 0 ≤ a) (hb : 0 ≤ b) (hab : b = 0 → a = 0) :

b * (a / (↑M * b)).negMulLog = -(1 / ↑M) * a * Real.log (a / b) + a * Real.log ↑M / ↑M

Algebraic identity for a single term in the negMulLog expansion.

theorem InfoTheory.algebraic_identity_for_dpi (M : ℕ) (hM : 2 ≤ M) (a : Fin M → Fin M → ℝ) (b : Fin M → ℝ) (ha : ∀ (j k : Fin M), 0 ≤ a j k) (hb : ∀ (k : Fin M), 0 ≤ b k) (hab : ∀ (j k : Fin M), b k = 0 → a j k = 0) (prob_sum : ∀ (j : Fin M), ∑ k : Fin M, a j k = 1) (_sum_b : ∑ k : Fin M, b k = 1) :

Real.log ↑M - 1 / ↑M * ∑ j : Fin M, ∑ k : Fin M, a j k * Real.log (a j k / b k) = ∑ k : Fin M, b k * ∑ j : Fin M, (a j k / (↑M * b k)).negMulLog

Algebraic identity relating KL partition sums to conditional entropy sums.

theorem InfoTheory.partition_sum_eq_one_dpi {Ω : Type u_1} [MeasurableSpace Ω] (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (M : ℕ) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

∑ k : Fin M, (μ (ψ ⁻¹' {k})).toReal = 1

Partition sum equals 1 for a probability measure.

theorem InfoTheory.data_processing_kl_partition {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

Real.log ↑M - 1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ ∑ k : Fin M, ((mixtureMeasure M P) (ψ ⁻¹' {k})).toReal * ∑ j : Fin M, (((P j) (ψ ⁻¹' {k})).toReal / (↑M * ((mixtureMeasure M P) (ψ ⁻¹' {k})).toReal)).negMulLog

Data processing inequality for KL divergence (DPI, conditional entropy form).

For M probability measures P₁,...,P_M with mixture P̄ = (1/M)∑Pⱼ and a measurable function ψ : Ω → Fin M, the conditional entropy H(Z|ψ(X)) of the uniform index Z ∈ {1,...,M} given the test output ψ(X) satisfies: log M - (1/M) ∑ⱼ KL(Pⱼ‖P̄) ≤ H(Z|ψ(X))

where H(Z|ψ(X)) = ∑ₖ P̄{ψ=k} · ∑ⱼ negMulLog(Pⱼ{ψ=k} / (M · P̄{ψ=k})).

This is equivalent to the standard information-theoretic DPI: I(Z;X) ≥ I(Z;ψ(X)), which says processing cannot increase mutual information.

Proved using klDiv_ge_partition_sum (Jensen's inequality for KL on partition cells) and algebraic_identity_for_dpi (the algebraic identity relating log M and negMulLog).

Reference: Cover & Thomas, Theorem 2.8.1 (DPI).

theorem InfoTheory.weighted_fano_bound {M : ℕ} (w : Fin M → ℝ) (hw_nn : ∀ (k : Fin M), 0 ≤ w k) (hw_sum : ∑ k : Fin M, w k = 1) (h e : Fin M → ℝ) (hbound : ∀ (k : Fin M), h k ≤ Real.log 2 + e k * Real.log (↑M - 1)) :

∑ k : Fin M, w k * h k ≤ Real.log 2 + (∑ k : Fin M, w k * e k) * Real.log (↑M - 1)

Weighted average of Fano bounds: if each conditional entropy term h k ≤ log 2 + e k * log(M-1) and the weights w k ≥ 0 sum to 1, then the weighted average ∑ w k * h k ≤ log 2 + (∑ w k * e k) * log(M-1).

theorem InfoTheory.cond_entropy_fano_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (ψ : Ω → Fin M) (hψ : Measurable ψ) :

∑ k : Fin M, ((mixtureMeasure M P) (ψ ⁻¹' {k})).toReal * ∑ j : Fin M, (((P j) (ψ ⁻¹' {k})).toReal / (↑M * ((mixtureMeasure M P) (ψ ⁻¹' {k})).toReal)).negMulLog ≤ Real.log 2 + (1 / ↑M * ∑ j : Fin M, ((P j) {ω : Ω | ψ ω ≠ j}).toReal) * Real.log (↑M - 1)

Conditional Fano bound on H(Z|ψ(X)).

The conditional entropy H(Z|ψ(X)) = ∑ₖ P̄{ψ=k} · ∑ⱼ negMulLog(Pⱼ{ψ=k}/(M·P̄{ψ=k})) is bounded by log 2 + pₑ · log(M-1), where pₑ = (1/M)∑ⱼ Pⱼ{ψ≠j}.

The proof applies fano_pointwise_entropy_bound to the posterior distribution πⱼₖ = Pⱼ{ψ=k}/(M·P̄{ψ=k}) for each outcome k, then takes a weighted average.

theorem InfoTheory.fano_conditional_entropy_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

Real.log ↑M - 1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ Real.log 2 + (1 / ↑M * ∑ j : Fin M, ((P j) {ω : Ω | ψ ω ≠ j}).toReal) * Real.log (↑M - 1)

Fano conditional entropy bound (Steps 1-3 of Theorem 5.10 proof).

For M probability measures P₁,...,P_M with mixture P̄ = (1/M)∑Pⱼ and a measurable function ψ : Ω → Fin M: log M - (1/M)∑ⱼ KL(Pⱼ ‖ P̄) ≤ log 2 + p_e · log(M-1) where p_e = (1/M)∑ⱼ Pⱼ(ψ(ω) ≠ j) is the average error probability.

Proved by combining data_processing_kl_partition (DPI) with cond_entropy_fano_bound (conditional Fano bound on H(Z|ψ(X))).

theorem InfoTheory.fano_avg_posterior_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

Real.log ↑M - 1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ 1 / ↑M * ∑ j : Fin M, (Real.log 2 + ((P j) {ω : Ω | ψ ω ≠ j}).toReal * Real.log (↑M - 1))

Fano average posterior bound (per-component form).

Algebraically equivalent to fano_conditional_entropy_bound; distributing (1/M) over the sum gives (1/M) ∑ⱼ (log 2 + pⱼ log(M−1)) = log 2 + ((1/M) ∑ pⱼ) log(M−1).

Proved from fano_conditional_entropy_bound by this algebraic identity.

theorem InfoTheory.fano_entropy_core_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

(1 / ↑M * ∑ j : Fin M, ((P j) {ω : Ω | ψ ω ≠ j}).toReal) * Real.log (↑M - 1) + Real.log 2 ≥ Real.log ↑M - 1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal

Fano entropy core bound: For any test ψ, the average error pe satisfies pe · log(M-1) + log 2 ≥ log M − (1/M) Σⱼ KL(Pⱼ ‖ P̄)

This is a direct consequence of fano_conditional_entropy_bound.

theorem InfoTheory.fano_binary_case {Ω : Type u_1} [MeasurableSpace Ω] (P : Fin 2 → MeasureTheory.Measure Ω) [∀ (j : Fin 2), MeasureTheory.IsProbabilityMeasure (P j)] (ψ : Ω → Fin 2) (_hψ : Measurable ψ) :

1 / 2 * ∑ j : Fin 2, ((P j) {ω : Ω | ψ ω ≠ j}).toReal ≥ 0

Binary Fano case (M = 2): When M = 2, log(M-1) = log 1 = 0 makes the Fano bound degenerate. In standard mathematics this yields a vacuous bound (RHS = −∞), but Lean's convention x / 0 = 0 produces RHS = 1, which is a non-trivial (and false) claim.

The correct bound for the binary case is simply that the average error is non-negative, which is trivially true. The textbook treats M = 2 as a degenerate case where the Neyman-Pearson approach applies instead of Fano.

Reference: Rigollet, Theorem 5.10 (M ≥ 2 with implicit degeneracy at M = 2).

theorem InfoTheory.fano_avg_error_bound {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 3 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

1 / ↑M * ∑ j : Fin M, ((P j) {ω : Ω | ψ ω ≠ j}).toReal ≥ 1 - (1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal + Real.log 2) / Real.log (↑M - 1)

Fano average error bound: The average (over j) testing error satisfies the Fano bound. This captures Steps 1–3 of the Fano proof (data processing inequality, chain rule for mutual information, binary entropy bound h(p) ≤ log 2).

Proved from fano_entropy_core_bound (the entropy-based core, axiomatized because it requires Shannon entropy infrastructure not in Mathlib) by algebraic rearrangement.

Requires M ≥ 3 because the denominator log(M-1) must be positive: for M = 2, log(M-1) = log 1 = 0, and in Lean x / 0 = 0, making the bound pe ≥ 1 which is false in general. All downstream applications use M ≥ 5.

Reference: Rigollet, proof of Theorem 5.10, Steps 1-3; Cover & Thomas, Theorem 2.10.1.

theorem InfoTheory.fano_lemma {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 3 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hfin_kl : ∀ (j : Fin M), InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) (ψ : Ω → Fin M) (hψ : Measurable ψ) :

∃ (j : Fin M), ((P j) {ω : Ω | ψ ω ≠ j}).toReal ≥ 1 - (1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal + Real.log 2) / Real.log (↑M - 1)

Fano's Lemma: The information-theoretic core of Fano's inequality.

For Z uniform on {1,...,M} and X|Z=j ~ Pⱼ, any test ψ satisfies: ∃ j, Pⱼ[ψ≠j] ≥ 1 - ((1/M)·Σⱼ KL(Pⱼ‖P̄) + log 2) / log(M-1)

Proved from fano_avg_error_bound (the average bound, proved from the axiomatized helper fano_entropy_core_bound) by a pigeonhole argument: if the average of M values ≥ bound, then some value ≥ bound. Requires M ≥ 3 (see fano_avg_error_bound).

Reference: Rigollet, proof of Theorem 5.10, Steps 1-3.

theorem InfoTheory.log_sum_le_sum_log {M : ℕ} (_hM : 0 < M) (w : Fin M → ℝ) (hw_pos : ∀ (k : Fin M), 0 < w k) (hw_sum : ∑ k : Fin M, w k = 1) (x : Fin M → ℝ) (hx_pos : ∀ (k : Fin M), 0 < x k) :

∑ k : Fin M, w k * Real.log (x k) ≤ Real.log (∑ k : Fin M, w k * x k)

Log-sum inequality: for weights w summing to 1 and positive reals x, log(Σ wₖ · xₖ) ≥ Σ wₖ · log(xₖ). This is Jensen's inequality for the concave function log.

theorem InfoTheory.neg_log_mean_le_mean_neg_log {M : ℕ} (hM : 0 < M) (a : Fin M → ℝ) (ha : ∀ (k : Fin M), 0 < a k) :

-Real.log (1 / ↑M * ∑ k : Fin M, a k) ≤ 1 / ↑M * ∑ k : Fin M, -Real.log (a k)

Finite Jensen inequality for -log (log-sum inequality). For positive reals a₁,...,aₘ: -log((1/M) · ∑ k, aₖ) ≤ (1/M) · ∑ k, -log(aₖ) Equivalently: log((1/M) · ∑ k, aₖ) ≥ (1/M) · ∑ k, log(aₖ)

theorem InfoTheory.rnDeriv_finset_sum_prob {Ω : Type u_1} [MeasurableSpace Ω] {M : ℕ} (P : Fin M → MeasureTheory.Measure Ω) (μ : MeasureTheory.Measure Ω) [∀ (i : Fin M), MeasureTheory.IsProbabilityMeasure (P i)] [MeasureTheory.SigmaFinite μ] :

(∑ i : Fin M, P i).rnDeriv μ =ᵐ[μ] ∑ i : Fin M, (P i).rnDeriv μ

Radon-Nikodym derivative distributes over a finite sum of probability measures.

theorem InfoTheory.integral_llr_mixture_le_avg {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hac : ∀ (j k : Fin M), (P j).AbsolutelyContinuous (P k)) (j : Fin M) (hfin : ∀ (k : Fin M), InformationTheory.klDiv (P j) (P k) ≠ ⊤) (h_mix_ne_top : InformationTheory.klDiv (P j) (mixtureMeasure M P) ≠ ⊤) :

∫ (x : Ω), MeasureTheory.llr (P j) (mixtureMeasure M P) x ∂P j ≤ 1 / ↑M * ∑ k : Fin M, ∫ (x : Ω), MeasureTheory.llr (P j) (P k) x ∂P j

Log-sum inequality for KL divergence integrals.

For mutually absolutely continuous probability measures P_j and P_k, ∫ llr(P_j, P̄) dP_j ≤ (1/M) ∑_k ∫ llr(P_j, P_k) dP_j, where P̄ = (1/M)∑_k P_k is the mixture measure.

This follows from:

The Radon-Nikodym derivative identity: dP̄/dP_j = (1/M) ∑_k dP_k/dP_j a.e. [P_j], using Measure.rnDeriv_smul_left_of_ne_top and Measure.rnDeriv_add.
Concavity of log (Jensen's inequality): log((1/M) ∑_k a_k) ≥ (1/M) ∑_k log(a_k) for a_k ≥ 0.
Taking negatives and integrating: ∫ -log(dP̄/dP_j) dP_j ≤ ∫ -(1/M) ∑_k log(dP_k/dP_j) dP_j = (1/M) ∑_k ∫ -log(dP_k/dP_j) dP_j = (1/M) ∑_k ∫ log(dP_j/dP_k) dP_j.

Reference: Cover & Thomas, Theorem 2.7.1 (Log-Sum Inequality).

theorem InfoTheory.kl_convex_in_second_arg {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hac : ∀ (j k : Fin M), (P j).AbsolutelyContinuous (P k)) (j : Fin M) (hfin : ∀ (k : Fin M), InformationTheory.klDiv (P j) (P k) ≠ ⊤) :

(InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ 1 / ↑M * ∑ k : Fin M, (InformationTheory.klDiv (P j) (P k)).toReal

KL convexity in second argument (log-sum inequality).

KL(Pⱼ ‖ (1/M)∑ₖ Pₖ) ≤ (1/M) ∑ₖ KL(Pⱼ ‖ Pₖ)

This is a consequence of the joint convexity of KL divergence, equivalently the log-sum inequality (Cover & Thomas, Theorem 2.7.2). The book (Rigollet, Theorem 5.10 Step 4) uses this as a known fact without detailed proof.

The proof reduces to integral_llr_mixture_le_avg (the integrated pointwise log-sum inequality) by rewriting both sides as integrals of llr against P_j.

Verified: only standard axioms (propext, Classical.choice, Quot.sound).

theorem InfoTheory.kl_mixture_le_avg_pairwise {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (hac : ∀ (j k : Fin M), (P j).AbsolutelyContinuous (P k)) (hfin : ∀ (j k : Fin M), InformationTheory.klDiv (P j) (P k) ≠ ⊤) :

1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ 1 / ↑M ^ 2 * ∑ j : Fin M, ∑ k : Fin M, (InformationTheory.klDiv (P j) (P k)).toReal

KL Convexity Bound: Average KL to mixture ≤ average pairwise KL.

(1/M) Σⱼ KL(Pⱼ ‖ P̄) ≤ (1/M²) ΣⱼΣₖ KL(Pⱼ ‖ Pₖ)

Proved from kl_convex_in_second_arg (the pointwise convexity bound, proved via the log-sum inequality / Jensen's inequality for -log) by summing over j and multiplying by 1/M.

Reference: Cover & Thomas, Theorem 2.7.2; Rigollet, Theorem 5.10 Step 4.

theorem InfoTheory.klDiv_component_le_log_M {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] (j : Fin M) :

(InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ Real.log ↑M

Pointwise KL bound: KL(Pⱼ ‖ P̄) ≤ log M for each j.

For each component measure Pⱼ and the mixture P̄ = (1/M)ΣₖPₖ, the KL divergence satisfies KL(Pⱼ ‖ P̄) ≤ log M. The proof uses the Radon-Nikodym derivative bound: since Pⱼ ≤ ΣₖPₖ, the RN derivative d(Pⱼ)/d(ΣₖPₖ) ≤ 1, hence d(Pⱼ)/d(P̄) = M · d(Pⱼ)/d(ΣₖPₖ) ≤ M, and log(·) ≤ log M.

Reference: Rigollet, Theorem 5.10 Step 4.

theorem InfoTheory.average_kl_le_log_M {Ω : Type u_1} [MeasurableSpace Ω] (M : ℕ) (hM : 2 ≤ M) (P : Fin M → MeasureTheory.Measure Ω) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P j)] :

1 / ↑M * ∑ j : Fin M, (InformationTheory.klDiv (P j) (mixtureMeasure M P)).toReal ≤ Real.log ↑M

Average KL to mixture ≤ log M.

(1/M) Σⱼ KL(Pⱼ ‖ P̄) ≤ log M

The average KL divergence from components to their mixture is bounded by log of the number of components. This is a direct consequence of the pointwise bound klDiv_component_le_log_M. Used in Step 4 of the proof of Theorem 5.10.

Reference: Rigollet, Theorem 5.10 Step 4.

4. Reduction to Testing #

The key bridge between estimation and testing: if hypotheses are well-separated, any estimator induces a test, and estimation error implies testing error.

Reference: Rigollet, Proposition 5.2 / Section 5.2.

def InfoTheory.sqDist {d : ℕ} (θ₁ θ₂ : Fin d → ℝ) :

Squared Euclidean distance.

Instances For

theorem InfoTheory.sqDist_symm {d : ℕ} (a b : Fin d → ℝ) :

sqDist a b = sqDist b a

Symmetry of squared Euclidean distance.

theorem InfoTheory.sqDist_quasi_triangle {d : ℕ} (a b c : Fin d → ℝ) :

sqDist a c ≤ 2 * (sqDist a b + sqDist b c)

Quasi-triangle inequality for squared Euclidean distance: |a-c|² ≤ 2(|a-b|² + |b-c|²).

def InfoTheory.isMinIdx {M : ℕ} (f : Fin M → ℝ) (n : ℕ) :

Predicate: n is a valid index that minimizes f over Fin M.

Instances For

@[implicit_reducible]

noncomputable instance InfoTheory.isMinIdx.decidable {M : ℕ} (f : Fin M → ℝ) :

DecidablePred (isMinIdx f)

theorem InfoTheory.exists_isMinIdx {M : ℕ} (f : Fin M → ℝ) (hM : 0 < M) :

∃ (n : ℕ), isMinIdx f n

There exists a minimizing index for any function on a nonempty Fin M.

noncomputable def InfoTheory.nnArgmin {M : ℕ} (f : Fin M → ℝ) (hM : 0 < M) :

Explicit argmin over Fin M using Nat.find for deterministic tie-breaking (always picks the smallest valid index).

Instances For

theorem InfoTheory.nnArgmin_spec {M : ℕ} (f : Fin M → ℝ) (hM : 0 < M) (k : Fin M) :

f (nnArgmin f hM) ≤ f k

nnArgmin returns a minimizer of f.

theorem InfoTheory.measurable_sqDist_apply {d : ℕ} {θhat : (Fin d → ℝ) → Fin d → ℝ} (hθhat : Measurable θhat) (c : Fin d → ℝ) :

Measurable fun (Y : Fin d → ℝ) => sqDist (θhat Y) c

sqDist (θhat ·) c is measurable when θhat is measurable.

theorem InfoTheory.nn_classifier_measurable {d M : ℕ} (θ : Fin M → Fin d → ℝ) (θhat : (Fin d → ℝ) → Fin d → ℝ) (hθhat : Measurable θhat) (hM : 0 < M) :

Measurable fun (Y : Fin d → ℝ) => nnArgmin (fun (j : Fin M) => sqDist (θhat Y) (θ j)) hM

Measurability of the nearest-neighbor classifier.

For a finite set of centers θ₁,...,θ_M, the nearest-neighbor classifier ψ(Y) = argmin_j |Y - θ_j|² is measurable, since Voronoi cells are intersections of halfspaces and thus Borel measurable.

The classifier uses nnArgmin (deterministic tie-breaking via Nat.find) rather than Exists.choose, making measurability provable.

Reference: Standard measure theory (Voronoi cells are Borel sets).

theorem InfoTheory.reduction_to_testing_fano {d M : ℕ} (hM : 3 ≤ M) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) (θ : Fin M → Fin d → ℝ) [∀ (j : Fin M), MeasureTheory.IsProbabilityMeasure (P (θ j))] (hac : ∀ (j k : Fin M), (P (θ j)).AbsolutelyContinuous (P (θ k))) (hfin : ∀ (j k : Fin M), InformationTheory.klDiv (P (θ j)) (P (θ k)) ≠ ⊤) (ϕ : ℝ) (hϕ : 0 < ϕ) (hsep : ∀ (j k : Fin M), j ≠ k → sqDist (θ j) (θ k) ≥ 4 * ϕ) (κ : ℝ) (hκ : 1 / ↑M ^ 2 * ∑ j : Fin M, ∑ k : Fin M, (InformationTheory.klDiv (P (θ j)) (P (θ k))).toReal ≤ κ) (θhat : (Fin d → ℝ) → Fin d → ℝ) (hθhat : Measurable θhat) :

⨆ (j : Fin M), ((P (θ j)) {Y : Fin d → ℝ | sqDist (θhat Y) (θ j) ≥ ϕ}).toReal ≥ 1 - (κ + Real.log 2) / Real.log (↑M - 1)

Reduction to Testing + Fano application.

Given M hypotheses θ₁,...,θ_M with:

separation: sqDist(θⱼ,θₖ) ≥ 4ϕ for j ≠ k
KL control: (1/M²) ΣΣ KL(Pⱼ,Pₖ) ≤ κ (average pairwise KL) Then for the probability-based minimax risk: inf_θ̂ max_j Pⱼ(|θ̂-θⱼ|² ≥ ϕ) ≥ 1 - (κ + log 2)/log(M-1)

This combines:

Reduction to testing: if |θⱼ-θₖ|² ≥ 4ϕ, nearest-neighbor classification from any estimator θ̂ satisfies: θ̂ misclassifies ⟹ |θ̂-θⱼ|² ≥ ϕ
Fano's inequality applied to the induced testing problem

Why not in Mathlib: The reduction to testing requires an argmin construction (nearest neighbor) on a finite metric space, plus the triangle inequality argument. Fano's inequality is also missing.

Reference: Rigollet, Theorem 5.10 + Proposition 5.2.

5. Gaussian KL Divergence #

The KL divergence between two Gaussians N(θ₁, σ²I) and N(θ₂, σ²I) with the same covariance is KL = |θ₁-θ₂|²/(2σ²).

For the Gaussian sequence model with n observations, this becomes KL(Pⱼ‖Pₖ) = n·|θⱼ-θₖ|²/(2σ²).

Reference: Rigollet, Example 5.7.

theorem InfoTheory.gaussian_family_withDensity {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (θ₁ θ₂ : Fin d → ℝ) :

P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))

Gaussian density ratio (axiom of the GSM): For the Gaussian sequence model N(θ, (σ²/n)·I_d), the density ratio between P_θ₁ and P_θ₂ is: P_θ₁ = (P_θ₂).withDensity (fun Y => exp((n/(2σ²)) · Σᵢ [(Yᵢ-θ₂ᵢ)²-(Yᵢ-θ₁ᵢ)²]))

This is the fundamental Gaussian density ratio property from which absolute continuity and the Radon–Nikodym derivative formula follow.

Why sorry'd: The hypothesis hGSM (second-moment matching) does not uniquely determine P_θ as a Gaussian distribution — many probability measures share the same second moments. Proving this requires the multivariate Gaussian density formula, which is not in Mathlib. The book does not prove this result (it is Problem 5.1(a), an exercise). Per the formalization policy, statements whose proofs are not given in the book may use sorry.

Reference: Standard multivariate Gaussian density; Rigollet, Problem 5.1(a).

theorem InfoTheory.gaussian_family_ac {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (θ₁ θ₂ : Fin d → ℝ) :

(P θ₁).AbsolutelyContinuous (P θ₂)

Helper: Gaussian measures with the same covariance are mutually absolutely continuous. Proved from gaussian_family_withDensity: since P_θ₁ is a withDensity of P_θ₂, we have P_θ₁ ≪ P_θ₂ by withDensity_absolutelyContinuous.

theorem InfoTheory.gaussian_family_coord_mean {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hCoordMean : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), Y i ∂P θ = θ i) (θ : Fin d → ℝ) (i : Fin d) :

∫ (Y : Fin d → ℝ), Y i ∂P θ = θ i

Helper: The Gaussian coordinate mean property: E_θ[Y_i] = θ_i. In the GSM Y_i = θ_i + ε_i with ε_i ~ N(0, σ²/n), so E_θ[Y_i] = θ_i + E[ε_i] = θ_i. This is the fundamental mean property of the Gaussian distribution. The mean condition is taken as an explicit hypothesis since the Gaussian mean formula is not available in Mathlib for the multivariate case.

theorem InfoTheory.gaussian_family_mean {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hCoordMean : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), Y i ∂P θ = θ i) (θ : Fin d → ℝ) (i : Fin d) :

∫ (Y : Fin d → ℝ), Y i - θ i ∂P θ = 0

Helper: The centered mean condition for Gaussian measures: E_θ[Y_i - θ_i] = 0. Proved from gaussian_family_coord_mean (E_θ[Y_i] = θ_i) and gaussian_family_coord_integrable (integrability) by linearity of integrals.

theorem InfoTheory.gaussian_family_rnDeriv_ae {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (θ₁ θ₂ : Fin d → ℝ) :

(P θ₁).rnDeriv (P θ₂) =ᵐ[P θ₁] fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))

Helper: The Radon-Nikodym derivative of Gaussian measures N(θ₁, (σ²/n)·I_d) with respect to N(θ₂, (σ²/n)·I_d) is dP_θ₁/dP_θ₂(y) = exp((n/(2σ²)) · Σᵢ [(yᵢ - θ₂ᵢ)² - (yᵢ - θ₁ᵢ)²]) almost everywhere. Proved from gaussian_family_withDensity using Measure.rnDeriv_withDensity and absolute continuity.

Reference: Standard multivariate Gaussian density; Rigollet, Problem 5.1(a).

theorem InfoTheory.gaussian_family_llr_ae {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (θ₁ θ₂ : Fin d → ℝ) :

MeasureTheory.llr (P θ₁) (P θ₂) =ᵐ[P θ₁] fun (Y : Fin d → ℝ) => ↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)

The pointwise log-likelihood ratio formula for Gaussians: llr(P_θ₁, P_θ₂)(Y) = (n/(2σ²)) · Σᵢ [(Y_i - θ₂_i)² - (Y_i - θ₁_i)²] a.e. This is the key Gaussian density computation from Example 5.7.

Proof: Follows from gaussian_family_rnDeriv_ae (the Gaussian density ratio formula, proved from gaussian_family_withDensity). Given that the Radon-Nikodym derivative is exp(f(Y)) a.e., we compute llr = log(rnDeriv.toReal) = log(exp(f(Y))) = f(Y) using ENNReal.toReal_ofReal and Real.log_exp.

Reference: Rigollet, Example 5.7 / Problem 5.1(a).

theorem InfoTheory.gaussian_family_coord_integrable {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (θ : Fin d → ℝ) (i : Fin d) :

MeasureTheory.Integrable (fun (Y : Fin d → ℝ) => Y i - θ i) (P θ)

Helper: Coordinate centering functions are integrable under Gaussian measures. Proved: finite second moment E[(Y_i-θ_i)²]=σ²/n > 0 implies the square is integrable (since ∫ f = c > 0 forces integrability), then MemLp 2 → Integrable on a probability measure.

theorem InfoTheory.gaussian_family_quad_integrable {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (θ₁ θ₂ : Fin d → ℝ) (i : Fin d) :

MeasureTheory.Integrable (fun (Y : Fin d → ℝ) => (Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2) (P θ₁)

Helper: Per-coordinate quadratic difference is integrable. Proved: expand (Y_i-θ₂_i)² = ((Y_i-θ₁_i)+(θ₁_i-θ₂_i))² to show it's integrable under P_θ₁ (using coord_integrable and the second moment), then take the difference.

theorem InfoTheory.gaussian_family_llr_integrable {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (θ₁ θ₂ : Fin d → ℝ) :

MeasureTheory.Integrable (MeasureTheory.llr (P θ₁) (P θ₂)) (P θ₁)

Helper: The log-likelihood ratio of two Gaussians is integrable. Proof: By gaussian_family_llr_ae, the llr is a.e. equal to the quadratic function (n/(2σ²)) * Σᵢ [(Y_i - θ₂_i)² - (Y_i - θ₁_i)²]. Each coordinate difference (Y_i - θ₂_i)² - (Y_i - θ₁_i)² is integrable under P_θ₁ (by gaussian_family_quad_integrable), so the sum (times a constant) is integrable. The llr, being a.e. equal to an integrable function, is itself integrable (by Integrable.congr).

theorem InfoTheory.gaussian_family_llr_integral {d : ℕ} (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (hCoordMean : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), Y i ∂P θ = θ i) (θ₁ θ₂ : Fin d → ℝ) :

(P θ₁).AbsolutelyContinuous (P θ₂) ∧ MeasureTheory.Integrable (MeasureTheory.llr (P θ₁) (P θ₂)) (P θ₁) ∧ ∫ (Y : Fin d → ℝ), MeasureTheory.llr (P θ₁) (P θ₂) Y ∂P θ₁ = ↑n * sqDist θ₁ θ₂ / (2 * σ ^ 2)

Gaussian LLR integral: For P_θ = N(θ, (σ²/n)·I_d), the log-likelihood ratio satisfies: (1) P_θ₁ ≪ P_θ₂, (2) llr is integrable, and (3) ∫ llr(P_θ₁, P_θ₂) dP_θ₁ = n·|θ₁-θ₂|²/(2σ²).

The proof uses sorry'd helper lemmas for Gaussian-specific properties (absolute continuity, pointwise llr formula, mean condition, integrability) which require the multivariate Gaussian density formula not in Mathlib. Given these, the proof carries out the integration computation:

Substitute the a.e. pointwise llr formula
Pull out the constant n/(2σ²) via integral_const_mul
Distribute integral over the coordinate sum via integral_finset_sum
For each coordinate i, expand (Y_i-θ₂_i)²-(Y_i-θ₁_i)² algebraically and use the mean condition E[Y_i-θ₁_i]=0 to compute the integral
Connect the result to sqDist via ring

Reference: Rigollet, Example 5.7 / Problem 5.1(a).

theorem InfoTheory.gaussian_kl_divergence {d : ℕ} (θ₁ θ₂ : Fin d → ℝ) (σ : ℝ) (hσ : 0 < σ) (n : ℕ) (hn : 0 < n) (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) [∀ (θ : Fin d → ℝ), MeasureTheory.IsProbabilityMeasure (P θ)] (hGSM : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), (Y i - θ i) ^ 2 ∂P θ = σ ^ 2 / ↑n) (hDensity : ∀ (θ₁ θ₂ : Fin d → ℝ), P θ₁ = (P θ₂).withDensity fun (Y : Fin d → ℝ) => ENNReal.ofReal (Real.exp (↑n / (2 * σ ^ 2) * ∑ i : Fin d, ((Y i - θ₂ i) ^ 2 - (Y i - θ₁ i) ^ 2)))) (hCoordMean : ∀ (θ : Fin d → ℝ) (i : Fin d), ∫ (Y : Fin d → ℝ), Y i ∂P θ = θ i) :

(InformationTheory.klDiv (P θ₁) (P θ₂)).toReal = ↑n * sqDist θ₁ θ₂ / (2 * σ ^ 2)

Gaussian KL: For the Gaussian sequence model, KL(P_{θ₁} ‖ P_{θ₂}) = n·|θ₁-θ₂|²/(2σ²).

Proved from gaussian_family_llr_integral (axiomatized because it requires multivariate Gaussian density not in Mathlib) and the definition of KL divergence (klDiv_of_ac_of_integrable): for probability measures P ≪ Q with integrable llr, klDiv P Q = ofReal(∫ llr P Q ∂P) so (klDiv P Q).toReal = ∫ llr P Q ∂P (since KL ≥ 0).

Reference: Rigollet, Example 5.7.

6. Varshamov-Gilbert Lemma #

The probabilistic method construction for Hamming-separated binary vectors.

Reference: Rigollet, Lemma 5.12; Varshamov (1957), Gilbert (1952).

def InfoTheory.hammingDist {d : ℕ} (ω₁ ω₂ : Fin d → Bool) :

Hamming distance between two binary vectors.

Instances For

theorem InfoTheory.hammingDist_symm {d : ℕ} (ω₁ ω₂ : Fin d → Bool) :

hammingDist ω₁ ω₂ = hammingDist ω₂ ω₁

Hamming distance is symmetric.

noncomputable def InfoTheory.pairSubtypeEquivFin2 {M : ℕ} (j k : Fin M) (hjk : j ≠ k) :

↥{j, k} ≃ Fin 2

Equivalence between a two-element subtype of Fin M and Fin 2.

Instances For

noncomputable def InfoTheory.fiberEquiv {M : ℕ} (j k : Fin M) (hjk : j ≠ k) (α : Type u_1) :

(Fin M → α) ≃ (α × α) × ({ i : Fin M // i ∉ {j, k} } → α)

Fiber decomposition: (Fin M → α) ≃ (α × α) × rest, where the pair component extracts the j-th and k-th coordinates.

Instances For

theorem InfoTheory.fiberEquiv_fst {M : ℕ} (j k : Fin M) (hjk : j ≠ k) (α : Type u_1) (ω : Fin M → α) :

((fiberEquiv j k hjk α) ω).1 = (ω j, ω k)

The fiber decomposition extracts (ω j, ω k) as the pair component.

Hoeffding pair-level bound infrastructure #

We prove the pair-level Hoeffding bound by:

Setting up the uniform product measure on Fin d → Bool
Applying hoeffding_sum_lower_tail to centered coordinate indicators
Converting the probability bound back to a counting bound
Reducing the pair-space problem to the single-string weight problem via XOR

noncomputable def InfoTheory.μ_Bool :

MeasureTheory.Measure Bool

The uniform probability measure on Bool.

Instances For

instance InfoTheory.μ_Bool_prob :

MeasureTheory.IsProbabilityMeasure μ_Bool

noncomputable def InfoTheory.boolCenter (b : Bool) :

The centered indicator: maps true ↦ 1/2, false ↦ -1/2.

Instances For

theorem InfoTheory.μ_Bool_singleton (b : Bool) :

μ_Bool {b} = 1 / 2

μ_Bool assigns measure 1/2 to each element.

theorem InfoTheory.integral_boolCenter_zero :

∫ (b : Bool), boolCenter b ∂μ_Bool = 0

The mean of boolCenter under the uniform measure on Bool is zero.

theorem InfoTheory.mean_boolCenter_coord_zero (d : ℕ) (i : Fin d) :

(∫ (ω : Fin d → Bool), boolCenter (ω i) ∂MeasureTheory.Measure.pi fun (x : Fin d) => μ_Bool) = 0

Under the pi measure, the integral of a coordinate function reduces to the single-variable integral.

theorem InfoTheory.indep_boolCenter_coords (d : ℕ) :

ProbabilityTheory.iIndepFun (fun (i : Fin d) (ω : Fin d → Bool) => boolCenter (ω i)) (MeasureTheory.Measure.pi fun (x : Fin d) => μ_Bool)

Coordinate centered indicators are independent under the pi measure.

theorem InfoTheory.hammingDist_eq_xor_weight {d : ℕ} (a b : Fin d → Bool) :

hammingDist a b = {i : Fin d | (a i ^^ b i) = true}.card

Under XOR, Hamming distance becomes the weight (number of true entries).

def InfoTheory.pairXorEquiv (d : ℕ) :

(Fin d → Bool) × (Fin d → Bool) ≃ (Fin d → Bool) × (Fin d → Bool)

The XOR map on pairs: (a, b) ↦ (a, a ⊕ b) is a self-inverse bijection.

Instances For

theorem InfoTheory.boolCenter_sum_eq_weight {d : ℕ} (c : Fin d → Bool) :

∑ i : Fin d, boolCenter (c i) = (∑ i : Fin d, if c i = true then 1 else 0) - ↑d / 2

The boolCenter sum equals the Hamming weight minus d/2.

theorem InfoTheory.hammingDist_cast_eq' {d : ℕ} (a b : Fin d → Bool) :

↑(hammingDist a b) = ∑ i : Fin d, if (a i ^^ b i) = true then 1 else 0

Hamming distance as a real-valued sum of xor indicators.

theorem InfoTheory.hammingDist_lt_iff_boolCenter_xor {d : ℕ} {γ : ℝ} (a b : Fin d → Bool) :

↑(hammingDist a b) < (1 / 2 - γ) * ↑d ↔ ∑ i : Fin d, boolCenter ((fun (i : Fin d) => a i ^^ b i) i) < -(γ * ↑d)

hammingDist(a,b) < (1/2-γ)d iff the boolCenter sum of a ⊕ b is < -γd.

theorem InfoTheory.hoeffding_pair_level_bound {d : ℕ} (hd : 0 < d) {γ : ℝ} (hγ_pos : 0 < γ) (hγ_lt : γ < 1 / 2) :

↑{p : (Fin d → Bool) × (Fin d → Bool) | ↑(hammingDist p.1 p.2) < (1 / 2 - γ) * ↑d}.card ≤ Real.exp (-(2 * γ ^ 2 * ↑d)) * ↑(Fintype.card ((Fin d → Bool) × (Fin d → Bool)))

Hoeffding's inequality (Theorem 1.9, Rigollet) applied to the pair space. For two independent uniform d-bit vectors, the probability that their Hamming distance is less than (1/2 - γ)d is at most exp(-2γ²d).

Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality).

theorem InfoTheory.hoeffding_pair_count_bound {d : ℕ} (hd : 0 < d) {γ : ℝ} (hγ_pos : 0 < γ) (hγ_lt : γ < 1 / 2) {M : ℕ} (j k : Fin M) (hjk : j ≠ k) :

↑{ω : Fin M → Fin d → Bool | ↑(hammingDist (ω j) (ω k)) < (1 / 2 - γ) * ↑d}.card ≤ Real.exp (-(2 * γ ^ 2 * ↑d)) * ↑(Fintype.card (Fin M → Fin d → Bool))

Hoeffding's inequality (Theorem 1.9, Rigollet) in counting form for a single pair. For independent uniform random d-bit vectors, the probability that their Hamming distance is less than (1/2 - γ)d is at most exp(-2γ²d). In counting terms, for a fixed pair (j,k), the number of M-tuples where pair (j,k) is "bad" is at most exp(-2γ²d) × total.

The proof uses a fiber decomposition: an M-tuple ω is decomposed into the pair (ω j, ω k) and the remaining M-2 coordinates. The bad-pair count factors as (bad pairs on pair space) × (size of fiber), and similarly for the total count. The ratio equals the pair-level ratio, which is bounded by Hoeffding.

Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality).

theorem InfoTheory.two_mul_card_lt_pairs_le_offDiag (M : ℕ) :

2 * {p : Fin M × Fin M | p.1 < p.2}.card ≤ Finset.univ.offDiag.card

The number of strictly ordered pairs (j < k) is at most half the off-diagonal count.

theorem InfoTheory.hoeffding_union_bound_count (d : ℕ) (hd : 0 < d) (γ : ℝ) (hγ_pos : 0 < γ) (hγ_lt : γ < 1 / 2) (M : ℕ) (hM : 1 < M) (hM_bound : ↑M * (↑M - 1) < 2 * Real.exp (2 * γ ^ 2 * ↑d)) :

{ω : Fin M → Fin d → Bool | ∃ (j : Fin M) (k : Fin M), j ≠ k ∧ ↑(hammingDist (ω j) (ω k)) < (1 / 2 - γ) * ↑d}.card < Fintype.card (Fin M → Fin d → Bool)

Hoeffding + union bound for M-tuples of binary vectors (Theorem 1.9, Rigollet).

By Hoeffding's inequality (Theorem 1.9, Rigollet), each pair of independent uniform random binary vectors has P(Hamming distance < (1/2-γ)d) ≤ exp(-2γ²d). The union bound over M(M-1)/2 unordered pairs gives: P(∃ bad pair) ≤ M(M-1)/2 · exp(-2γ²d) < 1 when M(M-1) < 2·exp(2γ²d). In the finite counting model on (Fin M → Fin d → Bool), this means the number of M-tuples with at least one bad pair is strictly less than the total number of M-tuples.

The proof uses the union bound over strictly ordered pairs (j < k) combined with hoeffding_pair_count_bound (Theorem 1.9, sorry'd as it requires Hoeffding's inequality which is not yet in Mathlib).

Reference: Rigollet, Theorem 1.9 (Hoeffding's inequality) + union bound.

theorem InfoTheory.probabilistic_method_separated_vectors (d : ℕ) (hd : 0 < d) (γ : ℝ) (hγ_pos : 0 < γ) (hγ_lt : γ < 1 / 2) (M : ℕ) (hM : 0 < M) (hM_bound : ↑M * (↑M - 1) < 2 * Real.exp (2 * γ ^ 2 * ↑d)) :

∃ (ω : Fin M → Fin d → Bool), ∀ (j k : Fin M), j ≠ k → ↑(hammingDist (ω j) (ω k)) ≥ (1 / 2 - γ) * ↑d

The probabilistic method via Hoeffding's inequality (Theorem 1.9, Rigollet): for M i.i.d. uniform binary vectors of length d, if M(M-1) < 2·exp(2γ²d), then the union bound + Hoeffding show P(all pairs separated) > 0, so separated vectors exist.

The proof proceeds as follows (Rigollet, Lemma 5.12):

Sample ω_{j,i} i.i.d. Bernoulli(1/2) for 1 ≤ i ≤ d, 1 ≤ j ≤ M.
For any pair j ≠ k, d - ρ(ω_j, ω_k) ~ Bin(d, 1/2).
By Hoeffding's inequality (Theorem 1.9): P(ρ(ω_j,ω_k) < (1/2-γ)d) = P(X - d/2 > γd) ≤ exp(-2γ²d).
Union bound: P(∃ j≠k bad) ≤ M(M-1)/2 · exp(-2γ²d) < 1 when M(M-1) < 2exp(2γ²d).
Hence P(all pairs good) > 0, and the desired vectors exist.

Proved from hoeffding_union_bound_count (which axiomatizes steps 1-4 via Hoeffding's inequality, Theorem 1.9) by extracting the existential (step 5).

Reference: Rigollet, Lemma 5.12 proof; applies Theorem 1.9 (Hoeffding).

theorem InfoTheory.varshamov_gilbert (d : ℕ) (hd : 0 < d) (γ : ℝ) (hγ_pos : 0 < γ) (hγ_lt : γ < 1 / 2) :

∃ (M : ℕ) (_ : 0 < M) (ω : Fin M → Fin d → Bool), ↑M ≥ Real.exp (γ ^ 2 * ↑d / 2) ∧ ∀ (j k : Fin M), j ≠ k → ↑(hammingDist (ω j) (ω k)) ≥ (1 / 2 - γ) * ↑d

Varshamov-Gilbert: For γ ∈ (0,1/2), ∃ M ≥ exp(γ²d/2) binary vectors with pairwise Hamming distance ≥ (1/2-γ)d.

Proof: Choose M = ⌈exp(γ²d/2)⌉₊ and verify M(M-1) < 2·exp(2γ²d), then apply probabilistic_method_separated_vectors.

Reference: Rigollet, Lemma 5.12.

7. Sparse Varshamov-Gilbert Lemma #

A variant producing weight-k binary vectors with Hamming separation.

Reference: Rigollet, Lemma 5.14.

def InfoTheory.l0norm_bool {d : ℕ} (ω : Fin d → Bool) :

ℓ₀ norm (weight) of a binary vector.

Instances For

theorem InfoTheory.sparse_varshamov_gilbert (d k : ℕ) (hk : 1 ≤ k) (hkd : k ≤ d / 8) :

∃ (M : ℕ) (_ : 0 < M) (ω : Fin M → Fin d → Bool), Real.log ↑M ≥ ↑k / 8 * Real.log (1 + ↑d / (2 * ↑k)) ∧ 5 ≤ M ∧ (∀ (j : Fin M), l0norm_bool (ω j) = k) ∧ ∀ (j k' : Fin M), j ≠ k' → ↑(hammingDist (ω j) (ω k')) ≥ ↑k / 2

Sparse Varshamov-Gilbert: For 1 ≤ k ≤ d/8, ∃ M binary vectors with weight k, pairwise Hamming distance ≥ k/2, and log M ≥ (k/8)·log(1+d/(2k)).

Proof: Greedy volume-counting on weight-k binary vectors. The excluded zone per vector has size bounded by Σ C(k,r)·C(d-k,r), and the total count C(d,k) forces enough well-separated vectors.

Why not in Mathlib: Binomial coefficient bounds C(d,k) ≥ (d/k)^k, the volume bound on restricted Hamming balls, and well-founded recursion for the greedy construction are not available.

Reference: Rigollet, Lemma 5.14.

8. Markov Bridge and Markov Step #

Converts probability-based minimax lower bounds to expectation-based lower bounds. If P(X ≥ ϕ) ≥ p, then E[X] ≥ p·ϕ (for nonneg X).

Reference: Standard (Markov's inequality converse).

theorem InfoTheory.sqDist_nonneg {d : ℕ} (x y : Fin d → ℝ) :

sqDist x y ≥ 0

Helper: sqDist is always nonneg.

theorem InfoTheory.markov_step {Ω : Type u_1} [MeasurableSpace Ω] (μ : MeasureTheory.Measure Ω) (f : Ω → ℝ) (ϕ p : ℝ) (hϕ : 0 < ϕ) (hp : 0 < p) (hnn : ∀ (ω : Ω), f ω ≥ 0) (hint : MeasureTheory.Integrable f μ) (hprob : (μ {ω : Ω | f ω ≥ ϕ}).toReal ≥ p) :

p * ϕ ≤ ∫ (ω : Ω), f ω ∂μ

Theorem (Markov step): For a nonneg integrable function f, if P(f ≥ ϕ) ≥ p, then E[f] ≥ p·ϕ.

Proof: By Markov's inequality (mul_meas_ge_le_integral_of_nonneg), ϕ · μ.real{f ≥ ϕ} ≤ ∫ f dμ. Since p ≤ μ.real{f ≥ ϕ} (from hprob) and ϕ > 0: p·ϕ ≤ μ.real{f ≥ ϕ}·ϕ = ϕ·μ.real{f ≥ ϕ} ≤ ∫ f dμ.

theorem InfoTheory.markov_bridge_strong {d : ℕ} (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) (Θ : Set (Fin d → ℝ)) (ϕ p : ℝ) (hϕ : 0 < ϕ) (hp : 0 < p) (hint : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∀ θ ∈ Θ, MeasureTheory.Integrable (fun (Y : Fin d → ℝ) => sqDist (θhat Y) θ) (P θ)) (hmeas : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∀ θ ∈ Θ, MeasureTheory.AEStronglyMeasurable (fun (Y : Fin d → ℝ) => sqDist (θhat Y) θ) (P θ)) (hbdd : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), BddAbove (Set.range fun (θ : Fin d → ℝ) => ⨆ (_ : θ ∈ Θ), ∫ (Y : Fin d → ℝ), sqDist (θhat Y) θ ∂P θ)) (hprob : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∃ θ ∈ Θ, ((P θ) {Y : Fin d → ℝ | sqDist (θhat Y) θ ≥ ϕ}).toReal ≥ p) :

⨅ (θhat : (Fin d → ℝ) → Fin d → ℝ), ⨆ θ ∈ Θ, ∫ (Y : Fin d → ℝ), sqDist (θhat Y) θ ∂P θ ≥ p * ϕ

Theorem (Markov bridge, strong form): If for every estimator, there exists θ such that P_θ(|θ̂-θ|² ≥ ϕ) ≥ p, then inf_{θ̂} sup_θ E_θ[|θ̂-θ|²] ≥ p·ϕ.

The proof structure is:

le_ciInf: reduce to showing inequality for arbitrary θ̂
Extract witness θ₀ from hprob
Markov step: p·ϕ ≤ ∫ sqDist via markov_step (combines ϕ·P(f≥ϕ) ≤ ∫f with p ≤ P(f≥ϕ))
Sup step: lift to ⨆ θ ∈ Θ via le_ciSup_of_le

theorem InfoTheory.markov_bridge {d : ℕ} (P : (Fin d → ℝ) → MeasureTheory.Measure (Fin d → ℝ)) (Θ : Set (Fin d → ℝ)) (ϕ p : ℝ) (hϕ : 0 < ϕ) (hp : 0 < p) (hint : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∀ θ ∈ Θ, MeasureTheory.Integrable (fun (Y : Fin d → ℝ) => sqDist (θhat Y) θ) (P θ)) (hmeas : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∀ θ ∈ Θ, MeasureTheory.AEStronglyMeasurable (fun (Y : Fin d → ℝ) => sqDist (θhat Y) θ) (P θ)) (hbdd : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), BddAbove (Set.range fun (θ : Fin d → ℝ) => ⨆ (_ : θ ∈ Θ), ∫ (Y : Fin d → ℝ), sqDist (θhat Y) θ ∂P θ)) (hprob : ∀ (θhat : (Fin d → ℝ) → Fin d → ℝ), ∃ θ ∈ Θ, ((P θ) {Y : Fin d → ℝ | sqDist (θhat Y) θ ≥ ϕ}).toReal ≥ p) :

⨅ (θhat : (Fin d → ℝ) → Fin d → ℝ), ⨆ θ ∈ Θ, ∫ (Y : Fin d → ℝ), sqDist (θhat Y) θ ∂P θ ≥ p * ϕ

Theorem (Markov bridge): If for every estimator, there exists θ such that P_θ(|θ̂-θ|² ≥ ϕ) ≥ p, then inf_{θ̂} sup_θ E_θ[|θ̂-θ|²] ≥ p·ϕ.

Proved from markov_bridge_strong.

Hamming ball volume × code size bound (Chernoff/entropy bound) #

For d ≥ 2 and any center ω₀ : Fin d → Bool, the Hamming ball of radius < d/4 has cardinality satisfying: ball_volume × max(⌈exp(d/8)⌉₊, 3) < 2^d

This encapsulates the Chernoff/entropy bound from Lemma 5.12: the binary entropy H(1/4) ≈ 0.81 satisfies H(1/4) + 1/(8·ln 2) < 1, so ball_volume ≤ 2^{d·H(1/4)} and M ≤ 2^{d/(8·ln 2)}, giving ball_volume × M ≤ 2^{d·(H(1/4) + 1/(8·ln 2))} < 2^d.

Reference: Rigollet, Lemma 5.12 proof (Chernoff/entropy bound on ball volume).

theorem InfoTheory.hammingDist_xor_eq {d : ℕ} (ω₀ ω : Fin d → Bool) :

hammingDist ω₀ ω = hammingDist (fun (x : Fin d) => false) fun (i : Fin d) => ω₀ i ^^ ω i

XOR bijection for Hamming balls: the XOR map ω ↦ ω ⊕ ω₀ is a bijection from the Hamming ball around ω₀ to the Hamming ball around the zero vector. This shows ball cardinality is center-independent.

theorem InfoTheory.xor_self_cancel {d : ℕ} (ω₀ ω : Fin d → Bool) :

(fun (i : Fin d) => ω₀ i ^^ (ω₀ i ^^ ω i)) = ω

theorem InfoTheory.hamming_ball_card_eq_zero_center {d : ℕ} (ω₀ : Fin d → Bool) :

{ω : Fin d → Bool | ↑(hammingDist ω₀ ω) < (1 / 2 - 1 / 4) * ↑d}.card = {ω : Fin d → Bool | ↑(hammingDist (fun (x : Fin d) => false) ω) < (1 / 2 - 1 / 4) * ↑d}.card

theorem InfoTheory.max_ceil_exp_div8_lt_two_pow (d : ℕ) (hd : 2 ≤ d) :

max ⌈Real.exp (↑d / 8)⌉₊ 3 < 2 ^ d

For d ≥ 2, the numerical bound max(⌈exp(d/8)⌉₊, 3) < 2^d.

Since 1/8 < ln 2, we have exp(d/8) < 2^d. The proof chains: ⌈exp(d/8)⌉₊ < exp(d/8) + 1 ≤ 2 · exp(d/8) = exp(ln 2 + d/8) ≤ exp(d · ln 2) = 2^d, where the last step uses ln 2 + d/8 ≤ d · ln 2 (equivalently ln 2 ≤ d·(ln 2 − 1/8), which holds for d ≥ 2 since ln 2 > 1/4).

theorem InfoTheory.tilting_identity (d : ℕ) :

∑ ω : Fin d → Bool, 3 ^ (d - hammingDist (fun (x : Fin d) => false) ω) = 4 ^ d

The tilting identity: ∑_{ω ∈ {0,1}^d} 3^(d - hammingWeight(ω)) = 4^d. This is the binomial theorem: ∑_k C(d,k) · 3^(d-k) · 1^k = (3+1)^d = 4^d.

theorem InfoTheory.chernoff_hammingDist_false_eq_true_count {d : ℕ} (ω : Fin d → Bool) :

hammingDist (fun (x : Fin d) => false) ω = {i : Fin d | ω i = true}.card

Chernoff/tilting bound on Hamming ball cardinality.

For the binary Hamming ball of radius < d/4 around the zero center, the cardinality satisfies ball_card * max(⌈exp(d/8)⌉₊, 3) < 2^d.

The proof uses the tilting method (Chernoff bound): the ball volume is bounded via the binomial theorem, giving ball_card * 3^(d-d/4) ≤ 4^d. Combined with the real inequality 2^d * (exp(d/8) + 1) < 3^(d-d/4) for d ≥ 19 (proved by induction with step 4), this yields the result for large d. Small d (2 ≤ d ≤ 18) is verified computationally via native_decide.

The numerical bound max(⌈exp(d/8)⌉₊, 3) < 2^d is proved in max_ceil_exp_div8_lt_two_pow. The ball volume bound requires establishing the entropy bound on partial binomial coefficient sums, which in turn needs Stirling-type estimates or the Chernoff method for binomial tails. These tools are not currently available in Mathlib.

Reference: Cover & Thomas, Elements of Information Theory, Theorem 11.1.2; Rigollet, Lemma 5.12 proof.

theorem InfoTheory.chernoff_prod_ite_eq_pow_sub {d : ℕ} (ω : Fin d → Bool) :

(∏ i : Fin d, if ω i = true then 1 else 3) = 3 ^ (d - {i : Fin d | ω i = true}.card)

theorem InfoTheory.chernoff_tilting_identity (d : ℕ) :

∑ ω : Fin d → Bool, 3 ^ (d - hammingDist (fun (x : Fin d) => false) ω) = 4 ^ d

theorem InfoTheory.chernoff_tilting_bound (d : ℕ) :

{ω : Fin d → Bool | 4 * hammingDist (fun (x : Fin d) => false) ω < d}.card * 3 ^ (d - (d - 1) / 4) ≤ 4 ^ d

theorem InfoTheory.chernoff_filter_eq (d : ℕ) :

{ω : Fin d → Bool | ↑(hammingDist (fun (x : Fin d) => false) ω) < (1 / 2 - 1 / 4) * ↑d} = {ω : Fin d → Bool | 4 * hammingDist (fun (x : Fin d) => false) ω < d}

theorem InfoTheory.chernoff_exp_five_eighths_lt_two :

Real.exp (5 / 8) < 2

theorem InfoTheory.chernoff_max_small (d : ℕ) (hd : d ≤ 5) :

max ⌈Real.exp (↑d / 8)⌉₊ 3 = 3

theorem InfoTheory.chernoff_exp_half_lt :

Real.exp (1 / 2) < 27 / 16

theorem InfoTheory.chernoff_induction_step (x : ℝ) (hx : 0 < x) :

16 * (x * Real.exp (1 / 2) + 1) < 27 * (x + 1)

theorem InfoTheory.chernoff_base6 :

2 ^ 6 * (Real.exp (6 / 8) + 1) < 3 ^ (6 - (6 - 1) / 4)

theorem InfoTheory.chernoff_base7 :

2 ^ 7 * (Real.exp (7 / 8) + 1) < 3 ^ (7 - (7 - 1) / 4)

theorem InfoTheory.chernoff_base8 :

2 ^ 8 * (Real.exp (8 / 8) + 1) < 3 ^ (8 - (8 - 1) / 4)

theorem InfoTheory.chernoff_base9 :

2 ^ 9 * (Real.exp (9 / 8) + 1) < 3 ^ (9 - (9 - 1) / 4)

theorem InfoTheory.chernoff_real_ineq (d : ℕ) (hd : 6 ≤ d) :

2 ^ d * (Real.exp (↑d / 8) + 1) < 3 ^ (d - (d - 1) / 4)

theorem InfoTheory.chernoff_exp34_gt_two :

2 < Real.exp (3 / 4)

theorem InfoTheory.chernoff_max_bound_large (d : ℕ) (hd : 6 ≤ d) :

↑(max ⌈Real.exp (↑d / 8)⌉₊ 3) ≤ Real.exp (↑d / 8) + 1

theorem InfoTheory.hamming_ball_volume_chernoff_bound (d : ℕ) (hd : 2 ≤ d) :

{ω : Fin d → Bool | ↑(hammingDist (fun (x : Fin d) => false) ω) < (1 / 2 - 1 / 4) * ↑d}.card * max ⌈Real.exp (↑d / 8)⌉₊ 3 < 2 ^ d

theorem InfoTheory.hamming_ball_volume_times_M_lt (d : ℕ) (hd : 2 ≤ d) (ω₀ : Fin d → Bool) :

{ω : Fin d → Bool | ↑(hammingDist ω₀ ω) < (1 / 2 - 1 / 4) * ↑d}.card * max ⌈Real.exp (↑d / 8)⌉₊ 3 < 2 ^ d

theorem InfoTheory.hamming_greedy_excluded_bound (d : ℕ) (hd : 2 ≤ d) (S : Finset (Fin d → Bool)) (hS : S.card + 1 ≤ max (⌈Real.exp (↑d / 8)⌉₊ + 1) 4) :

(S.biUnion fun (ω₀ : Fin d → Bool) => {ω : Fin d → Bool | ↑(hammingDist ω₀ ω) < (1 / 2 - 1 / 4) * ↑d}).card < 2 ^ d

Hamming ball exclusion bound (Hoeffding + ball volume counting).

For any set S of binary vectors of length d with |S| + 1 ≤ max(⌈exp(d/8)⌉₊ + 1, 4), the union of Hamming balls of radius < (1/2 - 1/4) * d around elements of S does not cover the full space {0,1}^d.

The proof reduces to the ball volume bound hamming_ball_volume_times_M_lt via the union bound: |⋃ ball(ω₀)| ≤ ∑ |ball(ω₀)| ≤ |S| × max_ball ≤ M × max_ball < 2^d.

Reference: Rigollet, Lemma 5.12 proof (Hoeffding bound on ball volume).

Verified: No sorry; axioms = {propext, Classical.choice, Quot.sound}.

theorem InfoTheory.greedy_step (d : ℕ) (hd : 2 ≤ d) (S : Finset (Fin d → Bool)) (hS : S.card + 1 ≤ max (⌈Real.exp (↑d / 8)⌉₊ + 1) 4) :

∃ (v : Fin d → Bool), ∀ ω₀ ∈ S, ↑(hammingDist ω₀ v) ≥ (1 / 2 - 1 / 4) * ↑d

Greedy step: given a set S of binary vectors with |S| + 1 ≤ M, there exists a vector far from all elements of S (Hamming distance ≥ (1/2 - 1/4) * d). Proved from hamming_greedy_excluded_bound.

theorem InfoTheory.exists_separated_vectors_greedy (d : ℕ) (hd : 2 ≤ d) (n : ℕ) (hn : n ≤ max (⌈Real.exp (↑d / 8)⌉₊ + 1) 4) :

∃ (f : Fin n → Fin d → Bool), ∀ (i j : Fin n), i ≠ j → ↑(hammingDist (f i) (f j)) ≥ (1 / 2 - 1 / 4) * ↑d

Greedy construction of n pairwise-separated binary vectors. For n ≤ max(⌈exp(d/8)⌉₊ + 1, 4), there exist n binary vectors of length d with pairwise Hamming distance ≥ (1/2 - 1/4) * d. Proved by induction using greedy_step.

theorem InfoTheory.greedy_separated_vectors (d : ℕ) (hd : 2 ≤ d) :

have M := max (⌈Real.exp (↑d / 8)⌉₊ + 1) 4; ∃ (ω : Fin M → Fin d → Bool), ∀ (j k : Fin M), j ≠ k → ↑(hammingDist (ω j) (ω k)) ≥ (1 / 2 - 1 / 4) * ↑d

Greedy construction of well-separated binary vectors (Hoeffding + ball volume).

For M = max(⌈exp(d/8)⌉₊ + 1, 4), there exist M binary vectors of length d with pairwise Hamming distance ≥ (1/2 - 1/4) * d.

Mathematical argument (greedy/iterative Varshamov-Gilbert):

At step k of the greedy construction, each of the k-1 existing vectors excludes at most exp(-d/8) · 2^d vectors from {0,1}^d (Hoeffding ball volume bound).
The construction succeeds as long as (k-1) · exp(-d/8) · 2^d < 2^d, i.e., k < exp(d/8) + 1.
This yields M ≥ exp(d/8) + 1 well-separated vectors.
The max with 4 ensures the downstream 4 ≤ M requirement for Corollary 5.13.

Proved from hamming_greedy_excluded_bound (via hamming_ball_volume_times_M_lt axiom, Chernoff/entropy bound) via greedy_step and exists_separated_vectors_greedy.

Reference: Rigollet, Lemma 5.12 (greedy variant) + Corollary 5.13 proof.

theorem InfoTheory.vg_strong_construction (d : ℕ) (hd : 2 ≤ d) :

∃ (M : ℕ) (_ : 0 < M) (ω : Fin M → Fin d → Bool), ↑M ≥ Real.exp (↑d / 8) + 1 ∧ 4 ≤ M ∧ ∀ (j k : Fin M), j ≠ k → ↑(hammingDist (ω j) (ω k)) ≥ (1 / 2 - 1 / 4) * ↑d

Helper (sorry'd): Strengthened probabilistic method for VG construction.

The standard probabilistic method (probabilistic_method_separated_vectors) constructs M binary vectors with M ≈ exp(γ²d/2). For Corollary 5.13 we need a construction with M large enough to satisfy both 4 ≤ M and log(M-1) ≥ d/8, while maintaining pairwise Hamming distance ≥ d/4.

This combines the Hoeffding-based probabilistic method with a more careful choice of M. The mathematical justification is the same probabilistic method (Hoeffding's inequality + counting), but with a tighter analysis.

Why sorry: Same foundation as hoeffding_pair_count_bound (Theorem 1.9, Hoeffding's inequality, not in Mathlib).

Reference: Rigollet, Lemma 5.12 + Corollary 5.13 proof.

theorem InfoTheory.varshamov_gilbert_for_cor513 (d : ℕ) (hd : 2 ≤ d) :

∃ (M : ℕ) (_ : 0 < M) (ω : Fin M → Fin d → Bool), ↑M ≥ Real.exp ((1 / 4) ^ 2 * ↑d / 2) ∧ 4 ≤ M ∧ Real.log (↑M - 1) ≥ ↑d / 8 ∧ ∀ (j k : Fin M), j ≠ k → ↑(hammingDist (ω j) (ω k)) ≥ (1 / 2 - 1 / 4) * ↑d

Varshamov-Gilbert for Corollary 5.13: Specialized VG with γ = 1/4 providing 4 ≤ M, log(M-1) ≥ d/8, and pairwise Hamming distance ≥ d/4. Proved from vg_strong_construction (sorry'd, same Hoeffding foundation).

Reference: Rigollet, Lemma 5.12 + proof of Corollary 5.13.