Documentation

Atlas.HighDimensionalStatistics.code.Chapter2.Thm_2_6

theorem dotProduct_sub_self {m : Type u_1} [Fintype m] (a b : m) :
(a - b) ⬝ᵥ (a - b) = a ⬝ᵥ a - 2 * a ⬝ᵥ b + b ⬝ᵥ b

Helper: expansion of dot product of differences. ⟨a - b, a - b⟩ = ⟨a, a⟩ - 2⟨a, b⟩ + ⟨b, b⟩

theorem thm_2_6_sparse_ls_bound {n d : } (X : Matrix (Fin n) (Fin d) ) (θstar : Fin d) (eps : Fin n) (k : ) (_hk : 1 k) (θhat : Fin d) (hmin : ∀ (θ : Fin d), {j : Fin d | θ j 0}.card k → (X.mulVec θstar + eps - X.mulVec θhat) ⬝ᵥ (X.mulVec θstar + eps - X.mulVec θhat) (X.mulVec θstar + eps - X.mulVec θ) ⬝ᵥ (X.mulVec θstar + eps - X.mulVec θ)) (_hhat_sparse : {j : Fin d | θhat j 0}.card k) (hstar_sparse : {j : Fin d | θstar j 0}.card k) (t : ) (_ht : 0 t) (hnoise : eps ⬝ᵥ X.mulVec (θhat - θstar) t) :
X.mulVec (θhat - θstar) ⬝ᵥ X.mulVec (θhat - θstar) 2 * t

Theorem 2.6 (ℓ₀-constrained LS tail bound, equation 2.6). If θ̂ minimizes |Y - Xθ|² over k-sparse θ and θ* is k-sparse, then P(|X(θ̂ - θ*)|² > 4t) ≤ C(d,2k) · 6^{2k} · exp(-t/(8σ²)) We state the deterministic version: given the fundamental inequality and a bound on the noise over all 2k-sparse supports.

theorem cor_2_8_mse_bound (n d k : ) (hn : 0 < n) (_hk : 1 k) (_hkd : 2 * k d) (σsq δ : ) (_hσ : 0 < σsq) (_hδ : 0 < δ) (sq_pred_err : ) (hbound : sq_pred_err 4 * (8 * σsq * (2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ)))) :
sq_pred_err / n 32 * σsq / n * (2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ))

Corollary 2.8 (ℓ₀-constrained LS MSE bound). Using Lemma 2.7, C(d,2k) ≤ (ed/(2k))^{2k}, the tail bound from Thm 2.6 yields the high-probability MSE bound: |X(θ̂ - θ*)|²/n ≤ (32σ²/n) · (2k · log(ed/(2k)) + 2k · log 6 + log(1/δ)) We state: if sq_pred_err satisfies the Thm 2.6 bound with threshold determined by σ², d, k, δ, then the MSE is controlled.

Probabilistic versions of Theorem 2.6 #

The book proof of Theorem 2.6 derives equation (2.6): P(|X(θ̂-θ*)|² > 4t) ≤ C(d,2k) · 6^{2k} · exp(-t/(8σ²)) by combining the fundamental inequality with sub-Gaussian concentration (Theorem 1.19) on each 2k-dimensional subspace and a union bound over C(d,2k) choices of support. Setting t = 8σ²·L where L = log C(d,2k) + 2k·log 6 + log(1/δ), the RHS becomes ≤ δ.

Applying Lemma 2.7: C(d,2k) ≤ (ed/(2k))^{2k}, so log C(d,2k) ≤ 2k·log(ed/(2k)), and the MSE bound becomes Corollary 2.8: MSE ≤ (32σ²/n)(2k·log(ed/(2k)) + 2k·log 6 + log(1/δ)).

Helper lemmas for the probabilistic proof #

theorem mul_exp_neg_le_of_ge_log {M δ L : } (hM : 0 < M) ( : 0 < δ) (hL : L Real.log (M / δ)) :
M * Real.exp (-L) δ

If M > 0, δ > 0, and L ≥ log(M/δ), then M · exp(-L) ≤ δ. This is the key algebraic step: the exponential decay of exp(-L) compensates the multiplicative factor M.

theorem tail_rhs_le_delta (d k : ) (δ : ) (hk : 1 k) (hkd : 2 * k d) (hδ_pos : 0 < δ) :
(d.choose (2 * k)) * 6 ^ (2 * k) * Real.exp (-(2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ))) δ

The tail bound RHS at the specific threshold t₀ = 8σ²·L evaluates to ≤ δ.

This uses Lemma 2.7 (C(d,2k) ≤ (ed/(2k))^{2k}) to show: C(d,2k) · 6^{2k} · exp(-L) ≤ δ where L = 2k·log(ed/(2k)) + 2k·log 6 + log(1/δ).

The chain:

  1. -t₀/(8σ²) = -L (by definition of t₀)
  2. log(C(d,2k) · 6^{2k} / δ) = log(C(d,2k)) + log(6^{2k}) + log(1/δ)
  3. By Lemma 2.7: log(C(d,2k)) ≤ 2k·log(ed/(2k))
  4. So L ≥ log(C(d,2k) · 6^{2k} / δ)
  5. Apply mul_exp_neg_le_of_ge_log
theorem L_pos (k d : ) (δ : ) (hk : 1 k) (hkd : 2 * k d) (hδ_pos : 0 < δ) (hδ_le : δ 1) :
0 < 2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ)

Positivity of the log-sum threshold L.

theorem prob_good_of_tail_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {f : Ω} {b δ : } (hδ_pos : 0 < δ) (htail : μ {ω : Ω | f ω > b} ENNReal.ofReal δ) :
μ {ω : Ω | f ω b} ENNReal.ofReal (1 - δ)

From a tail bound μ({f > b}) ≤ ENNReal.ofReal δ, derive the high-probability bound μ({f ≤ b}) ≥ ENNReal.ofReal (1 - δ) using probability complement.

theorem exp_dotProduct_integrable_of_subG {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n : } {ε : ΩFin n} {σ : } (_hσ : 0 < σ) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) (v : Fin n) (hv : v ⬝ᵥ v 1) (s : ) :
MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ

Integrability bridge for the Chernoff method.

The sub-Gaussian MGF bound ∫ exp(s·⟨ε,v⟩) ≤ exp(s²σ²/2) implies that exp(s · ⟨ε(·),v⟩) is integrable. In the book (Definition 1.2), sub-Gaussian random variables are defined to have finite MGF, which implies integrability.

In our formalization, the hsubG hypothesis bundles both integrability and the MGF bound, following Mathlib's ProbabilityTheory.Kernel.HasSubgaussianMGF pattern where integrability is a separate field alongside the MGF bound. This matches the book's implicit convention that sub-Gaussian random variables have finite MGF (Definition 1.2).

theorem chernoff_tail_from_mgf {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n : } {ε : ΩFin n} {σ : } ( : 0 < σ) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) (v : Fin n) (hv : v ⬝ᵥ v 1) (t' : ) (ht' : 0 < t') :
μ {ω : Ω | ε ω ⬝ᵥ v > t'} ENNReal.ofReal (Real.exp (-(t' ^ 2 / (2 * σ ^ 2))))

Chernoff tail bound from MGF hypothesis.

This bridges the raw MGF bound (as given in hsubG) to a tail probability bound. The proof follows exactly the Chernoff method used in Lemma 1.3 of the book: P(X > t) ≤ inf_{s>0} exp(-st) · E[exp(sX)] ≤ inf_{s>0} exp(s²σ²/2 - st) = exp(-t²/(2σ²))

The integrability of exp(s * ⟨ε,v⟩) is provided by exp_dotProduct_integrable_of_subG.

Rigollet, Chapter 1, Lemma 1.3.

theorem epsilon_net_covering_number {n : } (B : Finset (Fin n)) :
∃ (N : Finset (Fin n)), N.card 6 ^ B.card (∀ zN, z ⬝ᵥ z 1) (∀ zN, z Submodule.span B) ∀ (v : Fin n), v ⬝ᵥ v 1v Submodule.span BzN, (v - z) ⬝ᵥ (v - z) 1 / 4

ε-net covering number bound (Lemma 1.18 from Chapter 1).

For a subspace spanned by a finite set B of vectors in ℝ^n, the intersection of the unit ball with the subspace can be covered by at most 6^|B| balls of radius 1/2. That is, there exists a finite set N of unit vectors in span(B) with |N| ≤ 6^|B| such that every unit vector in span(B) has a net element within squared distance ≤ 1/4 (i.e., Euclidean distance ≤ 1/2).

The proof uses a greedy construction: iteratively pick points in B₂ ∩ span(B) that are at distance > 1/2 from all previously chosen points, then apply a volume comparison argument (balls of radius 1/4 around net points are disjoint and contained in (1+1/4)·B₂ ∩ span(B)) to bound the count.

Rigollet, Chapter 1, Lemma 1.18.

theorem epsilon_net_event_containment {n : } {N B : Finset (Fin n)} (_hN_unit : zN, z ⬝ᵥ z 1) (hN_span : zN, z Submodule.span B) (hN_cover : ∀ (v : Fin n), v ⬝ᵥ v 1v Submodule.span BzN, (v - z) ⬝ᵥ (v - z) 1 / 4) (V : Set (Fin n)) (hV_unit : vV, v ⬝ᵥ v 1) (hV_span : vV, v Submodule.span B) (ε_val : Fin n) (τ : ) (_hτ : 0 < τ) :
(∃ vV, ε_val ⬝ᵥ v > τ)zN, ε_val ⬝ᵥ z > τ / 2

Event containment from ε-net (part of Theorem 1.19, Chapter 1).

This is the iterative ε-net argument: if ε⬝v > τ for some unit vector v in the subspace span(B), then there exists a net element z with ε⬝z > τ/2.

The proof uses the geometric series argument: sup_{v ∈ V∩B₁} ε⬝v ≤ (1/(1-1/2)) · max_{z ∈ N} ε⬝z = 2 · max_{z ∈ N} ε⬝z where the contraction factor 1/2 comes from the 1/2-net property.

Rigollet, Chapter 1, Theorem 1.19.

theorem subG_subspace_sup_tail_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n : } (ε : ΩFin n) (m : ) (σ : ) ( : 0 < σ) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) (V : Set (Fin n)) (hV_unit : vV, v ⬝ᵥ v 1) (hV_dim : ∃ (B : Finset (Fin n)), B.card m vV, v Submodule.span B) (τ : ) ( : 0 < τ) :
μ {ω : Ω | vV, ε ω ⬝ᵥ v > τ} ENNReal.ofReal (6 ^ m * Real.exp (-(τ ^ 2 / (8 * σ ^ 2))))

Bridging lemma: Theorem 1.19's tail bound holds for sub-Gaussian ε (given in MGF form) projected onto any subspace of dimension ≤ m.

Specifically, for any finite set of unit vectors V = {v₁,...,vN} in ℝ^n with N ≤ 6^m, and any ε satisfying the MGF sub-Gaussian hypothesis, the union bound + Chernoff gives: P(∃ i, ε⬝vᵢ > τ) ≤ N · exp(-τ²/(2σ²)) ≤ 6^m · exp(-τ²/(2σ²))

The full ε-net argument from Theorem 1.19 combines this with event containment: if ∃ unit v in a ≤m-dim subspace with ε⬝v > t, then ∃ net element z with ε⬝z > t/2. So: P(sup_{unit v ∈ V_S} ε⬝v > t) ≤ 6^m · exp(-(t/2)²/(2σ²)) = 6^m · exp(-t²/(8σ²))

This combines the ε-net covering (Lemma 1.18), geometric event containment, per-element Chernoff bounds, and the union bound. The proof is given by Theorem 1.19 in Chapter 1.

Verified: no sorry, no axiom. Only standard Lean axioms (propext, Classical.choice, Quot.sound).

Proof structure:

  1. Obtain spanning set B and construct ε-net N via epsilon_net_covering_number
  2. Event containment: {∃ v ∈ V, ε⬝v > τ} ⊆ ⋃_{z ∈ N} {ε⬝z > τ/2} via epsilon_net_event_containment
  3. Per-element Chernoff bound via chernoff_tail_from_mgf
  4. Union bound: ∑_{z ∈ N} bound ≤ |N| · bound ≤ 6^m · exp(-τ²/(8σ²))
theorem per_support_concentration_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n d : } (X : Matrix (Fin n) (Fin d) ) (ε : ΩFin n) (k : ) (_hk : 1 k) (_hkd : 2 * k d) (σ : ) ( : 0 < σ) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) (S : Finset (Fin d)) (hS : S.card = 2 * k) (t : ) (ht : 0 < t) :
μ {ω : Ω | ∃ (w : Fin d), (∀ jS, w j = 0) X.mulVec w ⬝ᵥ X.mulVec w 1 ε ω ⬝ᵥ X.mulVec w > t} ENNReal.ofReal (6 ^ (2 * k) * Real.exp (-t / (8 * σ ^ 2)))

Per-support subspace concentration bound.

For a fixed support S ⊆ [d] with |S| = 2k and the corresponding column subspace V_S = colspan(X restricted to S), the probability that ε correlates with some unit vector in V_S beyond √t is bounded by 6^(2k) · exp(-t/(8σ²)).

The proof combines:

  1. A 1/2-net of the unit ball in V_S (dimension ≤ 2k), via Lemma 1.18
  2. Event containment: sup_{v ∈ V_S ∩ B₁} ε⬝v ≤ 2 · max_{z ∈ net} ε⬝z
  3. Per-net-element Chernoff bound from chernoff_tail_from_mgf
  4. Union bound over the ≤ 6^(2k) net elements

This is the per-subspace version of Theorem 1.19, specialized to 2k-dimensional column subspaces of X. The proof in the book (p. 33) applies Theorem 1.19 to each such subspace separately.

Rigollet, Chapter 2, equation (2.6), per-subspace step.

theorem sparse_ls_tail_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n d : } (X : Matrix (Fin n) (Fin d) ) (θstar : Fin d) (ε : ΩFin n) (θhat : ΩFin d) (k : ) (hk : 1 k) (hkd : 2 * k d) (σ : ) ( : 0 < σ) (hhat_sparse : ∀ (ω : Ω), {j : Fin d | θhat ω j 0}.card k) (hstar_sparse : {j : Fin d | θstar j 0}.card k) (hLS : ∀ (ω : Ω) (θ : Fin d), {j : Fin d | θ j 0}.card k → (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) (X.mulVec θstar + ε ω - X.mulVec θ) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec θ)) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) (t : ) (ht : 0 < t) :
μ {ω : Ω | X.mulVec (θhat ω - θstar) ⬝ᵥ X.mulVec (θhat ω - θstar) > 4 * t} ENNReal.ofReal ((d.choose (2 * k)) * 6 ^ (2 * k) * Real.exp (-t / (8 * σ ^ 2)))

Equation (2.6): Tail bound for sparse LS.

Combining the fundamental inequality |X(θ̂-θ*)|² ≤ 2⟨ε, X(θ̂-θ*)⟩ with sparse support decomposition onto 2k-dimensional subspaces, sub-Gaussian concentration (Theorem 1.19 applied to ε̃_S = Φ_S^T ε on each subspace), and a union bound over C(d,2k) choices of support S, we obtain:

P(|X(θ̂-θ*)|² > 4t) ≤ C(d,2k) · 6^{2k} · exp(-t/(8σ²))

for any t > 0. The proof involves:

  1. Both θ̂ and θ* are k-sparse, so θ̂-θ* is 2k-sparse
  2. Project noise onto 2k-dimensional column subspaces X_S
  3. Per-subspace concentration: P(sup_{u∈B₂} (ε̃_S^T u)² > t) ≤ 6^{2k} · exp(-t/(8σ²))
  4. Union bound over C(d,2k) choices of S

Rigollet, Chapter 2, equation (2.6).

theorem sparse_ls_high_prob_bound {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n d : } (X : Matrix (Fin n) (Fin d) ) (θstar : Fin d) (ε : ΩFin n) (θhat : ΩFin d) (k : ) (hk : 1 k) (hkd : 2 * k d) (σ : ) ( : 0 < σ) (δ : ) (hδ_pos : 0 < δ) (hδ_le : δ 1) (hhat_sparse : ∀ (ω : Ω), {j : Fin d | θhat ω j 0}.card k) (hstar_sparse : {j : Fin d | θstar j 0}.card k) (hLS : ∀ (ω : Ω) (θ : Fin d), {j : Fin d | θ j 0}.card k → (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) (X.mulVec θstar + ε ω - X.mulVec θ) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec θ)) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) :
μ {ω : Ω | X.mulVec (θhat ω - θstar) ⬝ᵥ X.mulVec (θhat ω - θstar) 32 * σ ^ 2 * (2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ))} ENNReal.ofReal (1 - δ)

Theorem 2.6 high-probability bound (squared prediction error).

From the tail bound (2.6), setting t = 8σ² · (log C(d,2k) + 2k·log 6 + log(1/δ)) makes the RHS ≤ δ. Using Lemma 2.7 to bound C(d,2k) ≤ (ed/(2k))^{2k}, we get log C(d,2k) ≤ 2k·log(ed/(2k)). The resulting squared prediction error bound (without dividing by n) is:

|X(θ̂-θ*)|² ≤ 32σ² · (2k·log(ed/(2k)) + 2k·log 6 + log(1/δ))

with probability ≥ 1-δ.

The proof involves choosing the threshold t, verifying the tail bound RHS ≤ δ, and passing to the complement event via measure theory.

Rigollet, Chapter 2, Theorem 2.6 + Corollary 2.8.

theorem thm_2_6_sparse_ls_high_prob {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n d : } (hn : 0 < n) (X : Matrix (Fin n) (Fin d) ) (θstar : Fin d) (ε : ΩFin n) (θhat : ΩFin d) (k : ) (hk : 1 k) (hkd : 2 * k d) (σ : ) ( : 0 < σ) (δ : ) (hδ_pos : 0 < δ) (hδ_le : δ 1) (hhat_sparse : ∀ (ω : Ω), {j : Fin d | θhat ω j 0}.card k) (hstar_sparse : {j : Fin d | θstar j 0}.card k) (hLS : ∀ (ω : Ω) (θ : Fin d), {j : Fin d | θ j 0}.card k → (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec (θhat ω)) (X.mulVec θstar + ε ω - X.mulVec θ) ⬝ᵥ (X.mulVec θstar + ε ω - X.mulVec θ)) (hsubG : ∀ (v : Fin n), v ⬝ᵥ v 1∀ (s : ), MeasureTheory.Integrable (fun (ω : Ω) => Real.exp (s * ε ω ⬝ᵥ v)) μ (ω : Ω), Real.exp (s * ε ω ⬝ᵥ v) μ Real.exp (s ^ 2 * σ ^ 2 / 2)) :
μ {ω : Ω | 1 / n * X.mulVec (θhat ω - θstar) ⬝ᵥ X.mulVec (θhat ω - θstar) 32 * σ ^ 2 / n * (2 * k * Real.log (Real.exp 1 * d / (2 * k)) + 2 * k * Real.log 6 + Real.log (1 / δ))} ENNReal.ofReal (1 - δ)

Theorem 2.6 + Corollary 2.8: High-probability MSE bound for sparse LS.

Under the linear model Y = Xθ* + ε where ε ~ subG_n(σ²), with θ̂ the ℓ₀-constrained LS estimator over k-sparse vectors and θ* ∈ B₀(k), for any δ ∈ (0,1], with probability at least 1 - δ:

MSE(Xθ̂) = (1/n)|Xθ̂ - Xθ*|² ≤ (32σ²/n)(2k·log(ed/(2k)) + 2k·log 6 + log(1/δ))

This is the main result of Theorem 2.6 combined with Corollary 2.8 (Rigollet, Chapter 2). The proof derives the MSE form from the squared prediction error bound by dividing by n.

Comparing with Theorem 2.2 (where MSE ≲ σ²r/n with r = rank(XᵀX)), we see that the price for not knowing the support of θ* but only its sparsity k is a logarithmic factor log(d/k) in the dimension.