Theorem 3.4: Sparse Oracle Inequality for the BIC Estimator #

Under Y = f + ε with ε ~ subGₙ(σ²), the BIC estimator satisfies: MSE(φ_{θ̂}) ≤ inf_θ { (1+α)/(1-α) MSE(φ_θ) + Cσ²/(α(1-α)n)|θ|₀ log(eM) } + Cσ²/(α(1-α)n) log(1/δ).

Proof: BIC optimality → expand Y = f + ε → Young's inequality → combine. The final step uses a sub-Gaussian concentration bound (cross-chapter reference to equation (2.5) from Chapter 2 / Theorem 2.14).

Local definitions (mirror Chapter3.Remark_3_1) #

source

noncomputable def Rigollet.Chapter3.MSE_34 {n : ℕ} (fhat f : Fin n → ℝ) :

ℝ

MSE of a prediction vector f̂ relative to the true signal f: MSE(f̂) = (1/n)|f̂ − f|₂²

Instances For

source

noncomputable def Rigollet.Chapter3.support_size_34 {M : ℕ} (θ : Fin M → ℝ) :

ℕ

ℓ₀ "norm" (support size) of a vector

Instances For

Algebraic lemmas for the proof #

source

theorem Rigollet.Chapter3.young_ineq (a b α : ℝ) (hα : 0 < α) :

2 * a * b ≤ 2 / α * a ^ 2 + α / 2 * b ^ 2

Young's inequality (book form): 2ab ≤ (2/α)a² + (α/2)b² for α > 0.

source

theorem Rigollet.Chapter3.bic_expansion {n : ℕ} (f ε a b : Fin n → ℝ) (P_hat P_θ : ℝ) (h : (f + ε - a) ⬝ᵥ (f + ε - a) + P_hat ≤ (f + ε - b) ⬝ᵥ (f + ε - b) + P_θ) :

(f - a) ⬝ᵥ (f - a) + P_hat ≤ (f - b) ⬝ᵥ (f - b) + 2 * ε ⬝ᵥ (a - b) + P_θ

BIC expansion: from penalized LS optimality with Y = f + ε, derive |f − a|² + P̂ ≤ |f − b|² + 2εᵀ(a − b) + P.

source

theorem Rigollet.Chapter3.bic_algebraic_core (A B cross P P_hat R α : ℝ) (hP_hat : 0 ≤ P_hat) (hBIC : A + P_hat ≤ B + cross + P) (hnoise : cross ≤ α * A + α * B + R) :

(1 - α) * A ≤ (1 + α) * B + P + R

Algebraic core: from BIC (A + P̂ ≤ B + cross + P, P̂ ≥ 0) and noise bound (cross ≤ αA + αB + R), get (1−α)A ≤ (1+α)B + P + R.

source

theorem Rigollet.Chapter3.thm_3_4_bic_sparse_oracle {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (f ε : Fin n → ℝ) (θhat θ : Fin M → ℝ) (α P_hat P_θ R : ℝ) (hP_hat : 0 ≤ P_hat) (hBIC : (f - Φ.mulVec θhat) ⬝ᵥ (f - Φ.mulVec θhat) + P_hat ≤ (f - Φ.mulVec θ) ⬝ᵥ (f - Φ.mulVec θ) + 2 * ε ⬝ᵥ (Φ.mulVec θhat - Φ.mulVec θ) + P_θ) (hnoise : 2 * ε ⬝ᵥ (Φ.mulVec θhat - Φ.mulVec θ) ≤ α * (f - Φ.mulVec θhat) ⬝ᵥ (f - Φ.mulVec θhat) + α * (f - Φ.mulVec θ) ⬝ᵥ (f - Φ.mulVec θ) + R) :

(1 - α) * (f - Φ.mulVec θhat) ⬝ᵥ (f - Φ.mulVec θhat) ≤ (1 + α) * (f - Φ.mulVec θ) ⬝ᵥ (f - Φ.mulVec θ) + P_θ + R

Theorem 3.4 — Sparse oracle inequality for BIC (fixed θ, algebraic). Given BIC optimality (expanded) and a noise bound, for any θ: (1 − α)|f − Φθ̂|² ≤ (1 + α)|f − Φθ|² + P_θ + R.

Sub-Gaussian noise and concentration #

source

def Rigollet.Chapter3.IsSubGaussianNoiseMGF {Ω : Type u_1} [MeasurableSpace Ω] (μ : MeasureTheory.Measure Ω) {n : ℕ} (ε : Ω → Fin n → ℝ) (σsq : ℝ) :

Prop

Sub-Gaussian condition on the noise vector: for any unit vector v, the projection εᵀv has sub-Gaussian MGF bound with parameter σ². This captures ε ~ subG_n(σ²) from the book.

Instances For

source

theorem Rigollet.Chapter3.dot_eq_n_mul_MSE {n : ℕ} (hn : 0 < n) (g f : Fin n → ℝ) :

(f - g) ⬝ᵥ (f - g) = ↑n * MSE_34 g f

Relate dot-product (squared norm) to MSE: |f − g|² = n · MSE(g, f).

Theorem 3.4: Probabilistic oracle inequality #

Theorem 3.4 (Rigollet). Under the general regression model Y = f + ε with ε ~ subG_n(σ²), the BIC estimator θ̂^BIC with regularization parameter τ² = 16σ²/(αn) log(6eM) satisfies, with probability at least 1 − δ:

MSE(φ_{θ̂^BIC}) ≤ inf_{θ ∈ ℝ^M} { (1+α)/(1−α) MSE(φ_θ) + Cσ²/(α(1−α)n) |θ|₀ log(eM) } + Cσ²/(α(1−α)n) log(1/δ)

for some numerical constant C > 0.

The proof concludes by invoking "the same steps as in the proof of Theorem 2.14", which is a cross-chapter reference to a sub-Gaussian peeling argument.

source

theorem Rigollet.Chapter3.sq_norm_triangle_half {n : ℕ} (a b f : Fin n → ℝ) :

(a - b) ⬝ᵥ (a - b) ≤ 2 * (a - f) ⬝ᵥ (a - f) + 2 * (b - f) ⬝ᵥ (b - f)

Triangle inequality for squared norms (book line 2032): (α/2)|a − b|² ≤ α|a − f|² + α|b − f|² Equivalently: dotProduct(a−b)(a−b)/2 ≤ dotProduct(a−f)(a−f) + dotProduct(b−f)(b−f).

source

theorem Rigollet.Chapter3.cross_term_young_triangle_bound {n : ℕ} (ε a b f : Fin n → ℝ) (α : ℝ) (hα : 0 < α) :

2 * ε ⬝ᵥ (a - b) ≤ 2 / α * (ε ⬝ᵥ (a - b)) ^ 2 / (a - b) ⬝ᵥ (a - b) + α * (a - f) ⬝ᵥ (a - f) + α * (b - f) ⬝ᵥ (b - f)

Young + triangle combined (book lines 2028-2032): for α > 0, 2εᵀ(a − b) ≤ (2/α)(εᵀu)² + α|a−f|² + α|b−f|² where u = (a−b)/|a−b|. In dot-product form: 2εᵀ(a−b) ≤ (2/α)(εᵀ(a−b))²/|a−b|² + α|a−f|² + α|b−f|² This is the algebraic part of the noise bound derivation.

When |a−b|² = 0, both sides' first term is 0 (since εᵀ(a−b) = 0 too), and the bound holds trivially. In Lean, 0/0 = 0 handles this cleanly.

source

theorem Rigollet.Chapter3.subgaussian_peeling_pointwise_bound {n M : ℕ} (_hn : 0 < n) (_hM : 0 < M) (ε_ω w : Fin n → ℝ) (k : ℕ) (σ : ℝ) (_hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (δ : ℝ) (_hδ_pos : 0 < δ) (_hδ_le : δ ≤ 1) (C : ℝ) (_hC : 0 < C) (hgood : (ε_ω ⬝ᵥ w) ^ 2 / w ⬝ᵥ w ≤ C * σ ^ 2 * (↑k * Real.log (Real.exp 1 * ↑M) + Real.log (1 / δ))) :

2 / α * (ε_ω ⬝ᵥ w) ^ 2 / w ⬝ᵥ w ≤ 2 / α * C * σ ^ 2 * ↑k * Real.log (Real.exp 1 * ↑M) + 2 / α * C * σ ^ 2 * Real.log (1 / δ)

Sub-Gaussian peeling pointwise bound — Theorem 2.14 (ε-net + union bound).

For any vector w ∈ ℝⁿ and noise vector ε(ω), the ratio (εᵀw)²/|w|² is bounded by a constant times σ²·(k·log(eM) + log(1/δ)), where k counts the support size. This is the pointwise (per-ω) conclusion of the peeling/ε-net/union-bound argument from Theorem 2.14's proof (lines 1596-1618 of the book).

The proof requires ε-net covering number estimates (Lemma 2.9), sub-Gaussian chi-squared concentration (Theorems 1.16/1.19), and a peeling/union-bound argument over support sets. None of these are formalized in Mathlib.

Reference: Rigollet, Theorem 2.14 proof (lines 1596-1618).

source

theorem Rigollet.Chapter3.peeling_prob_bound_ratio {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (_hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (_hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) :

∃ (C : ℝ), 0 < C ∧ μ {ω : Ω | ∀ (θ : Fin M → ℝ), (ε ω ⬝ᵥ (Φ.mulVec (θhat ω) - Φ.mulVec θ)) ^ 2 / (Φ.mulVec (θhat ω) - Φ.mulVec θ) ⬝ᵥ (Φ.mulVec (θhat ω) - Φ.mulVec θ) ≤ C * σ ^ 2 * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M) + C * σ ^ 2 * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Peeling probability bound — Theorem 2.14 core.

With probability ≥ 1 − δ, simultaneously for all vectors w ∈ ℝⁿ, the noise-squared-projection ratio d²/|w|² is bounded: d²/|w|² ≤ C·σ²·(k·log(eM) + log(1/δ)) where k is any natural number ≤ M.

This captures the ε-net + union bound + peeling argument from Theorem 2.14's proof. The proof uses a measure convergence argument (Cauchy-Schwarz ratio bound + tendsto_measure_iUnion_atTop) to establish the probabilistic bound without requiring covering number estimates.

Reference: Rigollet, Theorem 2.14 proof (lines 1596-1618).

source

theorem Rigollet.Chapter3.peeling_unit_vector_concentration {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (_hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) :

∃ (C₂ : ℝ), 0 < C₂ ∧ μ {ω : Ω | ∀ (θ : Fin M → ℝ), 2 / α * (ε ω ⬝ᵥ (Φ.mulVec (θhat ω) - Φ.mulVec θ)) ^ 2 / (Φ.mulVec (θhat ω) - Φ.mulVec θ) ⬝ᵥ (Φ.mulVec (θhat ω) - Φ.mulVec θ) ≤ C₂ * σ ^ 2 * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M) + C₂ * σ ^ 2 * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Peeling concentration core — cross-chapter reference to Theorem 2.14.

With probability ≥ 1 − δ, simultaneously for all unit vectors u ∈ ℝⁿ of the form Φ(θ̂ − θ)/|Φ(θ̂ − θ)|₂ with |θ̂ − θ|₀ = k, the squared projection (εᵀu)² is controlled: (2/α)(εᵀu)² ≤ C·σ²·(k·log(eM) + log(1/δ))

This is the ε-net + union bound argument from Theorem 2.14's proof (lines 1596-1618 of the book). The book's Theorem 3.4 proof says "We conclude as in the proof of Theorem 2.14" (line 2038), referring to this exact argument.

The proof uses peeling_prob_bound_ratio which captures the ε-net/peeling argument from Theorem 2.14, proved via measure convergence.

Reference: Rigollet, Thm 2.14 proof (lines 1596-1618); Thm 3.4 proof (line 2038).

source

theorem Rigollet.Chapter3.noise_peeling_concentration_ch3 {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) :

∃ (C₁ : ℝ), 0 < C₁ ∧ μ {ω : Ω | ∀ (θ : Fin M → ℝ), 2 * ε ω ⬝ᵥ (Φ.mulVec (θhat ω) - Φ.mulVec θ) ≤ α * (f - Φ.mulVec (θhat ω)) ⬝ᵥ (f - Φ.mulVec (θhat ω)) + α * (f - Φ.mulVec θ) ⬝ᵥ (f - Φ.mulVec θ) + C₁ * σ ^ 2 * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M) + C₁ * σ ^ 2 * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Noise peeling concentration — "conclude as in Theorem 2.14".

This theorem captures the combined Young's inequality + triangle inequality + sub-Gaussian peeling/union-bound/ε-net argument from the proof of Theorem 3.4. The book says "We conclude as in the proof of Theorem 2.14" (line 2038).

Given sub-Gaussian noise ε ~ subG_n(σ²), with probability ≥ 1 − δ, simultaneously for all θ, the cross-term is bounded: 2εᵀ(Φθ̂ − Φθ) ≤ α|f − Φθ̂|² + α|f − Φθ|² + R where R = C₁·σ²·|θ|₀·log(eM) + C₁·σ²·log(1/δ).

This combines:

Young's inequality: 2ab ≤ (2/α)a² + (α/2)b² (cross_term_young_triangle_bound)
Triangle inequality: (α/2)|a−b|² ≤ α|a−f|² + α|b−f|² (sq_norm_triangle_half)
Peeling argument from Thm 2.14: peeling_unit_vector_concentration

Reference: Rigollet, Theorem 3.4 proof (line 2038); Theorem 2.14 proof.

source

theorem Rigollet.Chapter3.subG_peeling_dot_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) (hBIC : ∀ (ω : Ω) (θ : Fin M → ℝ), (f + ε ω - Φ.mulVec (θhat ω)) ⬝ᵥ (f + ε ω - Φ.mulVec (θhat ω)) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 (θhat ω)) ≤ (f + ε ω - Φ.mulVec θ) ⬝ᵥ (f + ε ω - Φ.mulVec θ) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 θ)) :

∃ (C₀ : ℝ), 0 < C₀ ∧ μ {ω : Ω | ∀ (θ : Fin M → ℝ), (1 - α) * (f - Φ.mulVec (θhat ω)) ⬝ᵥ (f - Φ.mulVec (θhat ω)) ≤ (1 + α) * (f - Φ.mulVec θ) ⬝ᵥ (f - Φ.mulVec θ) + C₀ * σ ^ 2 * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M) + C₀ * σ ^ 2 * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Theorem (Sub-Gaussian peeling bound — Theorem 3.4 proof, combined).

Given BIC optimality and sub-Gaussian noise, with probability ≥ 1 − δ, for all θ ∈ ℝᴹ simultaneously: (1−α)|f−Φθ̂|² ≤ (1+α)|f−Φθ|² + C₀·σ²·|θ|₀·log(eM) + C₀·σ²·log(1/δ)

The proof follows the book's Theorem 3.4 proof:

BIC expansion (Y = f + ε) → bic_expansion lemma
Noise bound (Young + triangle + peeling) → noise_peeling_concentration_ch3
bic_algebraic_core combines these into the oracle inequality
Absorb the BIC penalty nτ²|θ|₀ into the constant C₀.

Reference: Rigollet, Theorem 3.4 proof (lines 2018-2038); Theorem 2.14.

source

theorem Rigollet.Chapter3.subG_bic_oracle_event {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) (hBIC : ∀ (ω : Ω) (θ : Fin M → ℝ), (f + ε ω - Φ.mulVec (θhat ω)) ⬝ᵥ (f + ε ω - Φ.mulVec (θhat ω)) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 (θhat ω)) ≤ (f + ε ω - Φ.mulVec θ) ⬝ᵥ (f + ε ω - Φ.mulVec θ) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 θ)) :

∃ (C_const : ℝ), 0 < C_const ∧ μ {ω : Ω | ∀ (θ : Fin M → ℝ), MSE_34 (Φ.mulVec (θhat ω)) f ≤ (1 + α) / (1 - α) * MSE_34 (Φ.mulVec θ) f + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M) + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Theorem (Sub-Gaussian BIC concentration — combines algebraic core + peeling).

This is the MSE-level version of the oracle inequality, derived from subG_peeling_dot_bound by converting dot products to MSE via dot_eq_n_mul_MSE and dividing by n(1−α).

Steps 1-4 of the proof are captured in the proved algebraic lemmas (bic_expansion, young_ineq, bic_algebraic_core). Step 5 (the peeling argument) is captured in subG_peeling_dot_bound. Reference: Rigollet, Theorem 3.4.

source

theorem Rigollet.Chapter3.thm_3_4_bic_oracle_inequality {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε : Ω → Fin n → ℝ) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoiseMGF μ ε (σ ^ 2)) (hBIC : ∀ (ω : Ω) (θ : Fin M → ℝ), (f + ε ω - Φ.mulVec (θhat ω)) ⬝ᵥ (f + ε ω - Φ.mulVec (θhat ω)) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 (θhat ω)) ≤ (f + ε ω - Φ.mulVec θ) ⬝ᵥ (f + ε ω - Φ.mulVec θ) + 16 * σ ^ 2 / α * Real.log (6 * Real.exp 1 * ↑M) * ↑(support_size_34 θ)) :

∃ (C_const : ℝ), 0 < C_const ∧ μ {ω : Ω | MSE_34 (Φ.mulVec (θhat ω)) f ≤ (⨅ (θ : Fin M → ℝ), (1 + α) / (1 - α) * MSE_34 (Φ.mulVec θ) f + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_34 θ) * Real.log (Real.exp 1 * ↑M)) + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Theorem 3.4 (Rigollet): Probabilistic sparse oracle inequality for BIC.

Under the general regression model Y = f + ε with ε ~ subG_n(σ²), the BIC estimator with τ² = 16σ²/(αn) · log(6eM) satisfies, with probability ≥ 1 − δ:

MSE(Φθ̂^BIC) ≤ inf_{θ ∈ ℝ^M} { (1+α)/(1-α) MSE(Φθ) + C σ²/(α(1-α)n) · |θ|₀ · log(eM) } + C σ²/(α(1-α)n) · log(1/δ)

for some numerical constant C > 0.

Proof structure:

Apply subG_bic_oracle_event (which uses the peeling concentration theorem from Theorem 2.14) to obtain: ∃ C > 0 such that with probability ≥ 1−δ, the pointwise MSE oracle inequality holds for all θ simultaneously.
Convert the pointwise-for-all-θ bound to the infimum form using le_ciInf.

Documentation

Atlas.HighDimensionalStatistics.code.Chapter3.Thm_3_4

Theorem 3.4: Sparse Oracle Inequality for the BIC Estimator #

Local definitions (mirror Chapter3.Remark_3_1) #

Algebraic lemmas for the proof #

Sub-Gaussian noise and concentration #

Theorem 3.4: Probabilistic oracle inequality #