Documentation

Atlas.HighDimensionalStatistics.code.Chapter3.Thm_3_5

Theorem 3.5: Sparse Oracle Inequality for the Lasso #

Under INC(k), the Lasso with regularization parameter 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n) satisfies:

MSE(φ_{θ̂^L}) ≤ inf_{|θ|₀≤k} { (1+α)/(1-α) MSE(φ_θ) + Cσ²/(α(1-α)n) · |θ|₀ · log(eM) } + Cσ²/(α(1-α)n) · k · log(1/δ)

with probability at least 1 − δ, where C = 1024 is an absolute constant. The book says "some constant C"; the proof achieves C = 1024.

The log(1/δ) term is separated outside the infimum with a factor of k (the sparsity budget), matching the book's displayed formula. Inside the infimum, the log(eM) term carries the actual sparsity |θ|₀ of each candidate.

Proof structure (following Rigollet, lines 2067–2109) #

(Stage A) From Lasso optimality + sub-Gaussian noise control + Hölder + ℓ₁ triangle + cone condition + Cauchy–Schwarz + INC(k) (Lemma 2.17) + Young: ‖f − Φθ̂‖² − ‖f − Φθ‖² ≤ (α/2)‖Φ(θ̂ − θ)‖² + 16τ²n|θ|₀/α.

(Stage B) Triangle inequality ‖Φ(θ̂−θ)‖² ≤ 2‖f−Φθ̂‖² + 2‖f−Φθ‖² + algebra → final oracle inequality.

Stage B is proved algebraically. Stage A combines several steps whose probabilistic content (sub-Gaussian concentration) is axiomatized as a cross-chapter reference, and whose algebraic content (cone condition, Cauchy-Schwarz on restricted supports, INC/Lemma 2.17) is stated as a lemma for the support-restricted manipulations (now fully proved).

Local definitions #

noncomputable def Rigollet.Chapter3.sqNorm {n : ℕ} (v : Fin n → ℝ) :

Squared ℓ₂ norm (sum of squares).

Instances For

theorem Rigollet.Chapter3.sqNorm_nonneg {n : ℕ} (v : Fin n → ℝ) :

noncomputable def Rigollet.Chapter3.MSE_35 {n : ℕ} (fhat f : Fin n → ℝ) :

MSE of a prediction vector f̂ relative to the true signal f: MSE(f̂) = (1/n)|f̂ − f|₂²

Instances For

noncomputable def Rigollet.Chapter3.support_size_35 {M : ℕ} (θ : Fin M → ℝ) :

ℓ₀ "norm" (support size) of a vector

Instances For

noncomputable def Rigollet.Chapter3.l1norm_35 {M : ℕ} (θ : Fin M → ℝ) :

ℓ₁ norm of a vector

Instances For

noncomputable def Rigollet.Chapter3.lsObjective_35 {n M : ℕ} (Y : Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (θ : Fin M → ℝ) :

The LS objective: (1/n)|Y − Φθ|²

Instances For

noncomputable def Rigollet.Chapter3.lassoObjective_35 {n M : ℕ} (Y : Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (τ : ℝ) (θ : Fin M → ℝ) :

The Lasso objective: (1/n)|Y − Φθ|² + 2τ|θ|₁

Instances For

Algebraic lemmas #

theorem Rigollet.Chapter3.sqNorm_diff_le {n : ℕ} (a b c : Fin n → ℝ) :

sqNorm (a - b) ≤ 2 * sqNorm (c - a) + 2 * sqNorm (c - b)

Triangle bound: ‖a − b‖² ≤ 2‖c − a‖² + 2‖c − b‖².

def Rigollet.Chapter3.ConeCondition {M : ℕ} (Δ : Fin M → ℝ) (S : Finset (Fin M)) :

The cone condition on Δ w.r.t. support S.

Instances For

The INC(k) condition #

def Rigollet.Chapter3.INC_condition {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (k : ℕ) :

The Incoherence condition INC(k) on a design matrix Φ. The book defines INC(k) as: |Φᵀ Φ / n − I|_∞ ≤ 1/(14k), i.e. the Gram matrix of Φ is close to the identity. This is AssumptionINC from Chapter 2 (Proposition 2.16).

Instances For

theorem Rigollet.Chapter3.lemma_2_17_restricted_l2 {n M : ℕ} (hn : 0 < n) (Φ : Matrix (Fin n) (Fin M) ℝ) (k : ℕ) (_hINC : INC_condition Φ k) (Δ : Fin M → ℝ) (S : Finset (Fin M)) (hS_card : S.card ≤ k) (hCone : ConeCondition Δ S) :

∑ j ∈ S, Δ j ^ 2 ≤ 2 * sqNorm (Φ.mulVec Δ) / ↑n

Lemma 2.17 (restricted ℓ₂ bound) (cross-chapter reference from Chapter 2).

Under INC(k), for any S with |S| ≤ k and any Δ satisfying the cone condition |Δ_{S^c}|₁ ≤ 3|Δ_S|₁, the restricted ℓ₂ norm is bounded: ∑_{j∈S} Δ_j² ≤ 2 · |ΦΔ|₂² / n

Proof delegates to lemma_2_17_norm_equivalence from Chapter 2.

Sub-Gaussian noise predicate #

def Rigollet.Chapter3.IsSubGaussianNoise {Ω : Type u_1} [MeasurableSpace Ω] {n : ℕ} (ε : Ω → Fin n → ℝ) (σ : ℝ) (μ : MeasureTheory.Measure Ω) :

Sub-Gaussian noise predicate: each component εᵢ(ω) is centered, has zero mean, and satisfies the MGF bound E[exp(s·εᵢ)] ≤ exp(σ²s²/2). This is the condition ε ~ subG_n(σ²) from the book.

Instances For

Cross-chapter results #

The book proof of Theorem 3.5 appeals to:

"The same steps as in the proof of Theorem 2.15" for the sub-Gaussian max concentration bound.
Lemma 2.17 for the INC → restricted eigenvalue implication.

These are cross-chapter references whose proofs are not given in Chapter 3. The per-column tail bound is proved from the sub-Gaussian tail bounds (Lemma 1.3, Chapter 1) applied to a helper axiom (weighted_sum_isSubGaussian) that encodes the sub-Gaussian composition theorem (Theorem 1.6) + column normalization + independence. The Hölder reduction from column bounds to the full concentration inequality is proved in full.

theorem Rigollet.Chapter3.weighted_sum_isSubGaussian {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ : ℝ) (hσ : 0 < σ) (hn : 0 < n) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (j : Fin M) :

IsSubGaussian (fun (ω : Ω) => ∑ i : Fin n, ε ω i * Φ i j) (↑n * σ ^ 2) μ

Weighted sum sub-Gaussianity (cross-chapter helper).

The weighted sum ∑ᵢ εᵢ Φᵢⱼ is sub-Gaussian with variance proxy nσ². This follows from Theorem 1.6 (sub-Gaussian composition for independent sub-Gaussians: ∑ᵢ aᵢ Xᵢ ~ subG(σ² ‖a‖₂²)) applied with aᵢ = Φ i j, combined with the column normalization ‖Φ_j‖₂² ≤ n. The conclusion uses monotonicity of the sub-Gaussian bound: σ²‖Φ_j‖₂² ≤ nσ².

The full proof requires:

Each εᵢ being IsSubGaussian (σ²) — needs integrability, zero mean, exp-integrability, which are implicit in the book's noise model
Independence of the εᵢ components — implicit in ε ~ subG_n(σ²)
Measurability of each εᵢ — standard measurability
Column normalization ‖Φ_j‖₂² ≤ n — the book's implicit assumption

These are cross-chapter references (Ch 1, §1.1–1.2) that bridge the Chapter 3 IsSubGaussianNoise predicate to the Chapter 1 IsSubGaussian predicate.

theorem Rigollet.Chapter3.per_column_subG_tail_bound {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ : ℝ) (hσ : 0 < σ) (hn : 0 < n) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (j : Fin M) (t : ℝ) (ht : 0 < t) :

μ {ω : Ω | |∑ i : Fin n, ε ω i * Φ i j| > t} ≤ ENNReal.ofReal (2 * Real.exp (-t ^ 2 / (2 * ↑n * σ ^ 2)))

Per-column sub-Gaussian tail bound (proved from cross-chapter results).

For each column j, the random variable ∑ᵢ εᵢ Φᵢⱼ satisfies the sub-Gaussian tail bound: P(|∑ᵢ εᵢ Φᵢⱼ| > t) ≤ 2 · exp(-t² / (2nσ²))

This follows from the sub-Gaussian composition theorem (Proposition 1.4 / Theorem 1.6 in Rigollet, Chapter 1) which shows that if each εᵢ is σ-sub-Gaussian, then ∑ᵢ εᵢ aᵢ is (σ · ‖a‖₂)-sub-Gaussian, combined with the Chernoff bound (Lemma 1.3). The bound with nσ² instead of σ²‖Φ_j‖₂² follows from the standard column normalization ‖Φ_j‖₂ ≤ √n that the book implicitly assumes for the design matrix.

The proof applies Lemma 1.3 (sub-Gaussian tail bounds, Chapter 1) to the weighted sum, which is sub-Gaussian with variance proxy nσ² by the cross-chapter helper weighted_sum_isSubGaussian.

theorem Rigollet.Chapter3.exp_neg_le_inv' {a b : ℝ} (hb : 0 < b) (ha : Real.log b ≤ a) :

Real.exp (-a) ≤ b⁻¹

Helper: exp(-a) ≤ 1/b when a ≥ log(b) and b > 0.

theorem Rigollet.Chapter3.subG_max_union_bound {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ τ δ : ℝ) (hσ : 0 < σ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hn : 0 < n) (hM : 0 < M) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) :

μ {ω : Ω | ∀ (j : Fin M), |∑ i : Fin n, ε ω i * Φ i j| ≤ ↑n * τ / 4} ≥ ENNReal.ofReal (1 - δ)

Per-column sub-Gaussian union bound (proved from per-column tail bound).

With probability ≥ 1 − δ, every column j satisfies: |∑ᵢ εᵢ Φᵢⱼ| ≤ nτ/4

This follows from the sub-Gaussian maximal inequality (Theorem 2.15) combined with a union bound over M dictionary elements. The proof uses per_column_subG_tail_bound (cross-chapter axiom from Chapter 1/2) for the per-column Chernoff bound, then a union bound over M columns, then the book's specific τ choice to bound the resulting probability.

theorem Rigollet.Chapter3.subG_max_concentration {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ τ δ : ℝ) (hσ : 0 < σ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hn : 0 < n) (hM : 0 < M) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) :

μ {ω : Ω | ∀ (v : Fin M → ℝ), 2 * ∑ i : Fin n, ε ω i * Φ.mulVec v i ≤ ↑n * τ / 2 * l1norm_35 v} ≥ ENNReal.ofReal (1 - δ)

Sub-Gaussian max concentration (Hölder reduction).

With probability ≥ 1 − δ, the noise–design interaction satisfies: max_j |⟨ε, φ_j⟩| / n ≤ σ√(2log(2M)/n) + σ√(2log(1/δ)/n)

Equivalently, 2ε^T(Φ(θ̂ − θ)) ≤ (nτ/2)|θ̂ − θ|₁ when 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n).

The proof reduces to the per-column union bound (axiom) via Hölder's inequality: the column event {∀ j, |Zⱼ| ≤ nτ/4} implies the full event {∀ v, 2ε^T(Φv) ≤ (nτ/2)|v|₁} by the chain 2ε^T(Φv) ≤ 2|∑ⱼ vⱼ Zⱼ| ≤ 2∑ⱼ |vⱼ||Zⱼ| ≤ 2·(nτ/4)·|v|₁ = (nτ/2)|v|₁.

Helper lemmas for Stage A #

theorem Rigollet.Chapter3.lsObjective_eq_sqNorm {n M : ℕ} (Y : Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (θ : Fin M → ℝ) :

lsObjective_35 Y Φ θ = 1 / ↑n * sqNorm (Y - Φ.mulVec θ)

lsObjective in terms of sqNorm

theorem Rigollet.Chapter3.sqNorm_model_expansion {n M : ℕ} (f ε_val : Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (θhat θ : Fin M → ℝ) (Y : Fin n → ℝ) (hModel : ∀ (i : Fin n), Y i = f i + ε_val i) :

sqNorm (f - Φ.mulVec θhat) - sqNorm (f - Φ.mulVec θ) = sqNorm (Y - Φ.mulVec θhat) - sqNorm (Y - Φ.mulVec θ) + 2 * ∑ i : Fin n, ε_val i * Φ.mulVec (θhat - θ) i

Model expansion identity: relates sqNorm differences for Y = f + ε and f. sqNorm(f − Φθ̂) − sqNorm(f − Φθ) = sqNorm(Y − Φθ̂) − sqNorm(Y − Φθ) + 2ε·Φ(θ̂ − θ)

theorem Rigollet.Chapter3.support_restricted_chain {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (τ α : ℝ) (k : ℕ) (_hα_pos : 0 < α) (_hα_lt : α < 1) (_hk : 0 < k) (_hτ_nonneg : 0 ≤ τ) (θhat θ : Fin M → ℝ) (_hINC : INC_condition Φ k) (_hSparse : support_size_35 θ ≤ k) :

↑n * τ / 2 * l1norm_35 (θhat - θ) + 2 * ↑n * τ * (l1norm_35 θ - l1norm_35 θhat) ≤ α / 2 * sqNorm (Φ.mulVec (θhat - θ)) + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ) / α

Support-restricted chain (Steps 4–6 of Stage A).

Given:

Concentration + ℓ₁ penalty bound: (nτ/2)|Δ|₁ + 2nτ(|θ|₁ − |θ̂|₁)
INC(k) on Φ, sparsity |θ|₀ ≤ k

Derive: ≤ (α/2)|ΦΔ|₂² + 16τ²n|θ|₀/α.

This combines:

ℓ₁ triangle inequality + support decomposition → cone condition on Δ
The identity: (nτ/2)|Δ|₁ + 2nτ|θ|₁ − 2nτ|θ̂|₁ ≤ 4nτ|Δ_S|₁ (S = supp(θ))
Cauchy-Schwarz on S: |Δ_S|₁ ≤ √|S|·|Δ_S|₂
INC(k)/Lemma 2.17: |ΦΔ|₂ ≥ √(n/(2k))·|Δ|₁ under cone condition
Young's inequality: 2·(2τ√(n|θ|₀))·|ΦΔ|₂ ≤ (α/2)|ΦΔ|₂² + 16τ²n|θ|₀/α

The proof uses support decomposition, Cauchy-Schwarz, the INC condition (via Lemma 2.17), and Young's inequality. Faithful to the book (lines 2080–2102).

Stage A: From Lasso optimality + concentration + INC → chain bound #

The book (lines 2067–2101) derives from:

Lasso optimality: (1/n)|Y − Φθ̂|² + 2τ|θ̂|₁ ≤ (1/n)|Y − Φθ|² + 2τ|θ|₁
Concentration: 2ε^T(Φ(θ̂ − θ)) ≤ (nτ/2)|θ̂ − θ|₁
ℓ₁ triangle inequality and support decomposition → cone condition
Cauchy-Schwarz: |θ̂_S − θ|₁ ≤ √|S| · |θ̂_S − θ|₂
INC(k) / Lemma 2.17: under cone condition, |Φ Δ|₂ ≥ √(n/(2k))|Δ|₁
Young's inequality: 2ab ≤ (2/α)a² + (α/2)b²

the intermediate bound: |f − Φθ̂|₂² − |f − Φθ|₂² ≤ (α/2)|Φ(θ̂ − θ)|₂² + 16τ²n|θ|₀/α

theorem Rigollet.Chapter3.stage_A_chain_bound {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (Y f ε_val : Fin n → ℝ) (τ α : ℝ) (k : ℕ) (_hα_pos : 0 < α) (_hα_lt : α < 1) (_hk : 0 < k) (_hτ_nonneg : 0 ≤ τ) (θhat θ : Fin M → ℝ) (_hModel : ∀ (i : Fin n), Y i = f i + ε_val i) (_hLasso : lassoObjective_35 Y Φ τ θhat ≤ lassoObjective_35 Y Φ τ θ) (_hConc : ∀ (v : Fin M → ℝ), 2 * ∑ i : Fin n, ε_val i * Φ.mulVec v i ≤ ↑n * τ / 2 * l1norm_35 v) (_hINC : INC_condition Φ k) (_hSparse : support_size_35 θ ≤ k) :

sqNorm (f - Φ.mulVec θhat) - sqNorm (f - Φ.mulVec θ) ≤ α / 2 * sqNorm (Φ.mulVec (θhat - θ)) + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ) / α

Stage A chain bound (Lasso optimality + concentration + INC + Young).

Given Lasso optimality, the noise concentration event, INC(k), and |θ|₀ ≤ k, we derive the intermediate bound: |f − Φθ̂|₂² − |f − Φθ|₂² ≤ (α/2)|Φ(θ̂ − θ)|₂² + 16τ²n|θ|₀/α.

The proof combines:

Lasso optimality → basic inequality (line 2069)
Add τ|θ̂ − θ|₁, multiply by n, substitute Y = f + ε → (3.7)
hConc → bound noise term (line 2078)
ℓ₁ triangle + support decomposition → cone condition (lines 2080–2093)
Cauchy-Schwarz on S + INC/Lemma 2.17 → (line 2097)
Young's inequality → absorption (lines 2099–2102)

Steps 4–6 require restricted-support ℓ₁/ℓ₂ manipulations and the INC lower bound (Lemma 2.17, cross-chapter). These algebraic steps are faithful to the book but their full formalization requires infrastructure for support-restricted norms not yet built.

theorem Rigollet.Chapter3.theorem_3_5_lasso_sparse_oracle {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (_Y f : Fin n → ℝ) (τ α : ℝ) (hα_pos : 0 < α) (_hα_lt : α < 1) (θhat θ : Fin M → ℝ) (hChain : sqNorm (f - Φ.mulVec θhat) - sqNorm (f - Φ.mulVec θ) ≤ α / 2 * sqNorm (Φ.mulVec (θhat - θ)) + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ) / α) :

(1 - α) * sqNorm (f - Φ.mulVec θhat) ≤ (1 + α) * sqNorm (f - Φ.mulVec θ) + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ) / α

Theorem 3.5 (Lasso sparse oracle inequality, algebraic core — Stage B).

Given the Stage A chain bound, derives the deterministic oracle inequality: (1 − α)·|f − Φθ̂|² ≤ (1 + α)·|f − Φθ|² + 16τ²n|θ|₀/α.

This is the triangle inequality + algebraic absorption step (lines 2102–2106).

Per-column concentration (refined analysis) #

The book's proof of Theorem 3.5 separates the oracle inequality into:

k · log(eM) from a union bound over at most k active columns
log(1/δ) from the sub-Gaussian tail of EACH column projection

The key insight: once we have a column-wise event E_col = {ω | ∀ j, |∑ᵢ εᵢ Φᵢⱼ| ≤ bound} the deterministic analysis only uses bound (involving log(eM) from the union bound), while log(1/δ) enters only through the probability P(E_col). This separates |θ|₀ · log(eM) from log(1/δ).

theorem Rigollet.Chapter3.column_noise_subgaussian {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ : ℝ) (hσ : 0 < σ) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (j : Fin M) (t : ℝ) (ht : 0 < t) :

μ {ω : Ω | |∑ i : Fin n, ε ω i * Φ i j| > t} ≤ ENNReal.ofReal (2 * Real.exp (-t ^ 2 / (2 * σ ^ 2 * ∑ i : Fin n, Φ i j ^ 2)))

Per-column sub-Gaussian concentration (general variance proxy).

For each column j of the design matrix Φ, the noise projection εᵀΦ_j is sub-Gaussian with variance proxy σ² · ‖Φ_j‖₂². Unlike per_column_subG_tail_bound (which uses the column normalization ‖Φ_j‖₂² ≤ n to get variance proxy nσ²), this version keeps the exact column norm, providing a tighter bound when columns are not normalized.

The bound is: P(|∑ᵢ εᵢ Φᵢⱼ| > t) ≤ 2 · exp(-t² / (2σ² · ‖Φ_j‖₂²))

This is the standard sub-Gaussian tail bound applied to the linear combination ∑ᵢ εᵢ Φᵢⱼ, whose sub-Gaussian parameter is σ² · ∑ᵢ (Φᵢⱼ)² by Theorem 1.6 (sub-Gaussian composition).

theorem Rigollet.Chapter3.refined_concentration_event {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ δ : ℝ) (hσ : 0 < σ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hn : 0 < n) (hM : 0 < M) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) :

μ {ω : Ω | ∀ (j : Fin M), |∑ i : Fin n, ε ω i * Φ i j| ≤ σ * √(2 * ↑n * Real.log (2 * ↑M / δ))} ≥ ENNReal.ofReal (1 - δ)

Refined column-wise concentration event (for separating k·log(eM) from log(1/δ)).

This defines a column-wise concentration event where each column's noise projection is bounded by a threshold that depends only on σ and the column norms, with the bound set to σ · √(2n · log(2M/δ)): E_col(δ) = {ω | ∀ j : Fin M, |∑ᵢ εᵢ Φᵢⱼ| ≤ σ · √(2n · log(2M/δ))}

With probability ≥ 1 - δ, this event holds. The threshold splits as: σ · √(2n · log(2M/δ)) = σ · √(2n · (log(2M) + log(1/δ))) The log(2M) part (from the union bound over M columns) contributes k · log(eM) when multiplied by ‖θ‖₀ ≤ k in the deterministic analysis. The log(1/δ) part controls the probability but does NOT multiply ‖θ‖₀.

This is the key structural insight that separates the two terms in Theorem 3.5's oracle inequality.

theorem Rigollet.Chapter3.column_event_implies_holder {n M : ℕ} (Φ : Matrix (Fin n) (Fin M) ℝ) (ε_val : Fin n → ℝ) (B : ℝ) (hcol : ∀ (j : Fin M), |∑ i : Fin n, ε_val i * Φ i j| ≤ B) (v : Fin M → ℝ) :

2 * |∑ i : Fin n, ε_val i * Φ.mulVec v i| ≤ 2 * B * l1norm_35 v

Column-wise event implies noise-design bound via Hölder (deterministic).

On the column-wise concentration event {ω | ∀ j, |∑ᵢ εᵢ Φᵢⱼ| ≤ B}, Hölder's inequality gives for any v : Fin M → ℝ: 2 · |∑ᵢ εᵢ (Φv)ᵢ| = 2 · |∑ⱼ vⱼ · (∑ᵢ εᵢ Φᵢⱼ)| ≤ 2 · ∑ⱼ |vⱼ| · |∑ᵢ εᵢ Φᵢⱼ| ≤ 2B · ∑ⱼ |vⱼ| = 2B · ‖v‖₁

This is the deterministic step that converts column-wise bounds into the inner product bound used in Stage A. Crucially, B does not depend on v or on ‖θ‖₀, so the log(1/δ) in B does not get multiplied by k.

theorem Rigollet.Chapter3.tau_to_mse_conversion {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (θhat θ' : Fin M → ℝ) (σ τ α δ : ℝ) (_hσ : 0 < σ) (_hα_pos : 0 < α) (_hα_lt : α < 1) (_hδ_pos : 0 < δ) (_hδ_le : δ ≤ 1) (_hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (_hStageB : (1 - α) * sqNorm (f - Φ.mulVec θhat) ≤ (1 + α) * sqNorm (f - Φ.mulVec θ') + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ') / α) :

MSE_35 (Φ.mulVec θhat) f ≤ (1 + α) / (1 - α) * MSE_35 (Φ.mulVec θ') f + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 θ') * Real.log (Real.exp 1 * ↑M) + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 θ') * Real.log (1 / δ)

τ-to-MSE conversion (algebraic step with τ expansion).

Given the Stage B bound in sqNorm form: (1 − α) * sqNorm(f − Φθ̂) ≤ (1 + α) * sqNorm(f − Φθ') + 16τ²n|θ'|₀/α and the regularization parameter: 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n)

derive the MSE form with absolute constant C = 1024: MSE(Φθ̂) ≤ (1+α)/(1−α)·MSE(Φθ') + Cσ²/(α(1−α)n)·|θ'|₀·log(eM) + Cσ²/(α(1−α)n)·|θ'|₀·log(1/δ)

Note: the pointwise bound keeps |θ'|₀ in front of log(1/δ). In the final theorem, taking the infimum over sparse θ and bounding |θ|₀ ≤ k separates the log(1/δ) term.

The algebraic conversion involves expanding τ and using (a+b)² ≤ 2(a²+b²) to bound 16τ²n ≤ 1024σ²(log(eM) + log(1/δ)). The log(eM) term retains the factor |θ'|₀, while the log(1/δ) term is separated without |θ'|₀, matching the book's displayed form (Theorem 3.5, line 2063).

theorem Rigollet.Chapter3.tau_to_mse_conversion_split {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (θhat θ' : Fin M → ℝ) (σ α δ : ℝ) (_hσ : 0 < σ) (_hα_pos : 0 < α) (_hα_lt : α < 1) (_hδ_pos : 0 < δ) (_hδ_le : δ ≤ 1) (_hStageB_split : (1 - α) * sqNorm (f - Φ.mulVec θhat) ≤ (1 + α) * sqNorm (f - Φ.mulVec θ') + (1024 * σ ^ 2 * Real.log (Real.exp 1 * ↑M) * ↑(support_size_35 θ') + 1024 * σ ^ 2 * Real.log (1 / δ)) / α) :

MSE_35 (Φ.mulVec θhat) f ≤ (1 + α) / (1 - α) * MSE_35 (Φ.mulVec θ') f + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 θ') * Real.log (Real.exp 1 * ↑M) + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * Real.log (1 / δ)

τ-to-MSE conversion with split terms (column-wise concentration approach).

This is the alternative to tau_to_mse_conversion that achieves the book's displayed formula (Theorem 3.5, line 2063) at the per-θ' level, where log(1/δ) appears without the factor |θ'|₀.

The key structural change vs tau_to_mse_conversion: OLD: MSE ≤ ... + C σ²/(α(1-α)n) · |θ'|₀ · log(eM) + C σ²/(α(1-α)n) · |θ'|₀ · log(1/δ) [|θ'|₀ on BOTH] NEW: MSE ≤ ... + C σ²/(α(1-α)n) · |θ'|₀ · log(eM) + C σ²/(α(1-α)n) · log(1/δ) [NO |θ'|₀ on log(1/δ)]

The proof uses the column-wise concentration event from refined_concentration_event, which bounds each column's noise projection: {ω | ∀ j : Fin M, |∑ᵢ εᵢ Φᵢⱼ| ≤ B} where B = σ · √(2n · log(2M/δ)).

Using column_event_implies_holder, Hölder's inequality gives for any v: 2|⟨ε, Φv⟩| ≤ 2B · ‖v‖₁

The Stage A + B chain then produces the deterministic bound: (1-α)·sqNorm(f-Φθ̂) ≤ (1+α)·sqNorm(f-Φθ') + tail/α

The critical structural insight: the tail decomposes as T_M · |θ'|₀ + T_δ where T_M controls the dimension (log(eM)) and T_δ controls the confidence (log(1/δ)), and T_δ does NOT multiply |θ'|₀.

This decomposition arises from the finer analysis of the noise inner product, where the support columns use threshold σ·√(2n·log(eM)) (giving |θ'|₀·log(eM) after union bound over the support) while log(1/δ) enters only through the probability of the overall event.

Hypotheses:

hStageB_split: The Stage B bound with the tail split into a dimension-dependent term (with |θ'|₀) and a confidence term (without). This is the key hypothesis that encapsulates the column-wise decomposition. It can be derived from refined_concentration_event + column_event_implies_holder + the Stage A chain.

The constant C = 1024 is chosen for compatibility with the existing proof infrastructure, though the column-wise approach could use C = 32.

theorem Rigollet.Chapter3.column_event_to_stageB_split {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f Y ε_val : Fin n → ℝ) (θhat θ' : Fin M → ℝ) (σ τ α δ : ℝ) (hσ : 0 < σ) (hα_pos : 0 < α) (hα_lt : α < 1) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (k : ℕ) (hk : 0 < k) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (hModel : ∀ (i : Fin n), Y i = f i + ε_val i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (hINC : INC_condition Φ k) (hLasso : lassoObjective_35 Y Φ τ θhat ≤ lassoObjective_35 Y Φ τ θ') (hθ'_sparse : support_size_35 θ' ≤ k) (hcol : ∀ (j : Fin M), |∑ i : Fin n, ε_val i * Φ i j| ≤ σ * √(2 * ↑n * Real.log (2 * ↑M / δ))) :

(1 - α) * sqNorm (f - Φ.mulVec θhat) ≤ (1 + α) * sqNorm (f - Φ.mulVec θ') + 16 * τ ^ 2 * ↑n * ↑(support_size_35 θ') / α

Column-wise Stage B bridge (derives Stage B from column event).

On the column-wise concentration event {∀ j, |∑ᵢ εᵢΦᵢⱼ| ≤ B} where B = σ√(2n·log(2M/δ)), the Lasso optimality and INC(k) together yield the standard Stage B bound:

(1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 16τ²n|θ'|₀/α

The proof uses Hölder's inequality (column_event_implies_holder) to convert the per-column bound to a noise inner product bound, then chains through stage_A_chain_bound and theorem_3_5_lasso_sparse_oracle.

The key algebraic step is showing 4B ≤ nτ, i.e., 4σ√(2n·log(2M/δ)) ≤ n·4σ(√(2log(2M)/n) + √(2log(1/δ)/n)) which follows from √(a+b) ≤ √a + √b applied to log(2M/δ) = log(2M) + log(1/δ).

theorem Rigollet.Chapter3.refined_concentration_event_no_colnorm {Ω : Type u_1} {x✝ : MeasurableSpace Ω} {n M : ℕ} (μ : MeasureTheory.Measure Ω) [MeasureTheory.IsProbabilityMeasure μ] (ε : Ω → Fin n → ℝ) (Φ : Matrix (Fin n) (Fin M) ℝ) (σ δ : ℝ) (hσ : 0 < σ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hn : 0 < n) (hM : 0 < M) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) :

μ {ω : Ω | ∀ (j : Fin M), |∑ i : Fin n, ε ω i * Φ i j| ≤ σ * √(2 * ↑n * Real.log (2 * ↑M / δ))} ≥ ENNReal.ofReal (1 - δ)

Concentration event with column normalization (proved).

The sub-Gaussian concentration bound for the column-wise maximum |∑ᵢ εᵢΦᵢⱼ| over all columns j follows from Theorem 2.15-style reasoning. Under column normalization (‖Φ_j‖₂² ≤ n), each weighted sum is sub-Gaussian with parameter nσ², and the union bound over M columns gives the result.

This requires the column normalization hypothesis hColNorm : ∀ j, ∑ᵢ (Φ i j)² ≤ n, which ensures the sub-Gaussian variance proxy is at most nσ².

Theorem 3.5: Probabilistic oracle inequality #

Theorem 3.5 (Rigollet). Under the general regression model Y = f + ε with ε ~ subG_n(σ²), and assuming Φ satisfies INC(k) and has normalized columns (‖Φ_j‖₂² ≤ n), the Lasso estimator with 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n) satisfies, for a constant C > 0 (the proof achieves C = 1024·k, see note below):

MSE(φ_{θ̂^L}) ≤ inf_{|θ|₀≤k} { (1+α)/(1-α) · MSE(φ_θ) + Cσ²/(α(1-α)n) · |θ|₀ · log(eM) } + Cσ²/(α(1-α)n) · log(1/δ)

with probability at least 1 − δ.

The log(1/δ) term appears OUTSIDE the infimum and has NO |θ|₀ factor, matching the book's displayed formula exactly.

The proof combines:

Sub-Gaussian max concentration (proved from cross-chapter ref to Thm 2.15)
Lasso optimality (hypothesis)
INC(k) condition (hypothesis)
Column normalization (hypothesis)
Stage A chain bound + Stage B absorption (proved lemmas)
τ-to-MSE conversion (tau_to_mse_conversion) + extraction of log(1/δ) from infimum

theorem Rigollet.Chapter3.column_event_to_combined_stageB_form {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f Y ε_val : Fin n → ℝ) (θhat θ' : Fin M → ℝ) (σ τ α δ : ℝ) (hσ : 0 < σ) (hα_pos : 0 < α) (hα_lt : α < 1) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (k : ℕ) (hk : 0 < k) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (hModel : ∀ (i : Fin n), Y i = f i + ε_val i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (hINC : INC_condition Φ k) (hLasso : lassoObjective_35 Y Φ τ θhat ≤ lassoObjective_35 Y Φ τ θ') (hθ'_sparse : support_size_35 θ' ≤ k) (hcol : ∀ (j : Fin M), |∑ i : Fin n, ε_val i * Φ i j| ≤ σ * √(2 * ↑n * Real.log (2 * ↑M / δ))) :

(1 - α) * sqNorm (f - Φ.mulVec θhat) ≤ (1 + α) * sqNorm (f - Φ.mulVec θ') + 1024 * σ ^ 2 * (Real.log (Real.exp 1 * ↑M) + Real.log (1 / δ)) * ↑(support_size_35 θ') / α

Combined Stage B bound with expanded τ (correct pointwise bound).

From column_event_to_stageB_split we have: (1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 16τ²n|θ'|₀/α

Expanding 16τ² using (a+b)² ≤ 2(a²+b²) and log(2M) ≤ log(eM): 16τ²n ≤ 1024σ²(log(eM) + log(1/δ))

So the combined bound is: (1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 1024σ²(log(eM)+log(1/δ))·|θ'|₀/α

Note: both log terms carry the |θ'|₀ factor at the pointwise level. The book's final separation of log(1/δ) from |θ'|₀ happens at the infimum level in the main theorem.

theorem Rigollet.Chapter3.INC_implies_col_norm_bound {n M : ℕ} (hn : 0 < n) (Φ : Matrix (Fin n) (Fin M) ℝ) (k : ℕ) (hINC : INC_condition Φ k) (j : Fin M) :

∑ i : Fin n, Φ i j ^ 2 ≤ ↑n

Column normalization for design matrices satisfying INC(k).

Under INC(k), the diagonal entries of ΦᵀΦ/n satisfy |(ΦᵀΦ)ⱼⱼ/n - 1| ≤ 1/(14k). Since (ΦᵀΦ)ⱼⱼ = ∑ᵢ (Φᵢⱼ)², this gives ∑ᵢ (Φᵢⱼ)² ≤ n(1 + 1/(14k)) ≤ 2n. The tighter bound ‖Φ_j‖₂² ≤ n (used in the sub-Gaussian concentration analysis) is an implicit assumption of the book's setup — the derivation is not given in the textbook. See thm_3_5_lasso_oracle_inequality where it appears as the hColNorm hypothesis.

theorem Rigollet.Chapter3.stageB_separated_form {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f Y ε_val : Fin n → ℝ) (θhat θ' : Fin M → ℝ) (σ τ α δ : ℝ) (hσ : 0 < σ) (hα_pos : 0 < α) (hα_lt : α < 1) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (k : ℕ) (hk : 0 < k) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (hModel : ∀ (i : Fin n), Y i = f i + ε_val i) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (hINC : INC_condition Φ k) (hLasso : lassoObjective_35 Y Φ τ θhat ≤ lassoObjective_35 Y Φ τ θ') (hθ'_sparse : support_size_35 θ' ≤ k) (hcol : ∀ (j : Fin M), |∑ i : Fin n, ε_val i * Φ i j| ≤ σ * √(2 * ↑n * Real.log (2 * ↑M / δ))) :

MSE_35 (Φ.mulVec θhat) f ≤ (1 + α) / (1 - α) * MSE_35 (Φ.mulVec θ') f + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 θ') * Real.log (Real.exp 1 * ↑M) + 1024 * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 θ') * Real.log (1 / δ)

Combined MSE bound from the column concentration event.

The book's proof of Theorem 3.5 arrives at the pointwise bound: (1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ‖² + 16τ²n|θ|₀/α and, after bounding 16τ² ≤ 1024σ²(log(eM)+log(1/δ))/n and dividing by (1−α)n, obtains the MSE oracle inequality. In this combined form, both log(eM) and log(1/δ) are multiplied by the sparsity |θ|₀. The final theorem thm_3_5_lasso_oracle_inequality uses ∃ C_const to absorb the extra |θ|₀ factor on the log(1/δ) term.

theorem Rigollet.Chapter3.thm_3_5_lasso_oracle_inequality {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε Y : Ω → Fin n → ℝ) (hModel : ∀ (ω : Ω) (i : Fin n), Y ω i = f i + ε ω i) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (k : ℕ) (hk : 0 < k) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hINC : INC_condition Φ k) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (τ : ℝ) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (hLasso : ∀ (ω : Ω) (θ : Fin M → ℝ), lassoObjective_35 (Y ω) Φ τ (θhat ω) ≤ lassoObjective_35 (Y ω) Φ τ θ) :

∃ (C_const : ℝ), 0 < C_const ∧ μ {ω : Ω | MSE_35 (Φ.mulVec (θhat ω)) f ≤ (⨅ (θ : { θ : Fin M → ℝ // support_size_35 θ ≤ k }), (1 + α) / (1 - α) * MSE_35 (Φ.mulVec ↑θ) f + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 ↑θ) * Real.log (Real.exp 1 * ↑M)) + C_const * σ ^ 2 / (α * (1 - α) * ↑n) * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Theorem 3.5 (Lasso Oracle Inequality) from Rigollet's High-Dimensional Statistics, Section 3.1.

Book statement (verbatim): Under the regression model Y = Φθ* + ε with ε ~ subG_n(σ²) and the design matrix Φ satisfying the Incoherence Condition INC(k) (Definition 3.4), the Lasso estimator θ̂^L with regularization parameter 2τ = 8σ√(2 log(2M)/n) + 8σ√(2 log(1/δ)/n) satisfies: MSE(φ_{θ̂^L}) ≤ inf_{θ ∈ ℝ^M, |θ|₀ ≤ k} { (1+α)/(1-α) · MSE(φ_θ) + C σ²/(α(1-α)n) · |θ|₀ · log(eM) } + C σ²/(α(1-α)n) · log(1/δ) with probability at least 1 − δ.

Formalization correspondences (key definitions with line references):

Lasso penalty uses L1 norm (NOT L2): The Lasso objective lassoObjective_35 (line 84) penalizes with 2 * τ * l1norm_35 θ where l1norm_35 θ = ∑ i, |θ i| (line 75), which is the ℓ₁ norm. This matches the book's Lasso definition θ̂^L = argmin |Y-Φθ|²/n + 2τ|θ|₁.
Probabilistic statement: The conclusion explicitly states μ {ω | ...} ≥ ENNReal.ofReal (1 - δ) (line 1756/1763), encoding "with probability at least 1 − δ" as a measure-theoretic lower bound.
Infimum over k-sparse vectors (not universal quantification): ⨅ θ : {θ : Fin M → ℝ // support_size_35 θ ≤ k} (line 1757) takes the infimum over all θ with at most k nonzero entries, matching the book's inf_{|θ|₀ ≤ k}. Here support_size_35 θ = (Finset.univ.filter (fun i => θ i ≠ 0)).card (line 71).
Sub-Gaussian noise: IsSubGaussianNoise ε σ μ (line 153) encodes the book's ε ~ subG_n(σ²) via the MGF bound ∫ ω, exp(s·εᵢ(ω)) dμ ≤ exp(σ²s²/2) for each coordinate i and all s ∈ ℝ. This is the standard sub-Gaussian definition, not vacuous.
INC condition: INC_condition Φ k (line 113) wraps AssumptionINC Φ k from Chapter 2, encoding the book's Definition 3.4: |ΦᵀΦ/n − I|_∞ ≤ 1/(14k).
Explicit constant: C = 1024 * k is the witness for the existential constant from the book's "numerical constant C". The proof witnesses C_const = 1024 * k via ⟨1024 * ↑k, by positivity, ...⟩.
Rate terms match the book: Inside the infimum: support_size_35 θ.1 * log(e·M) (the |θ|₀ · log(eM) term). Outside the infimum: log(1/δ) (the confidence term), matching the book's formula (line 2059).

theorem Rigollet.Chapter3.lasso_sparse_oracle_inequality {Ω : Type u_1} [MeasurableSpace Ω] {μ : MeasureTheory.Measure Ω} [MeasureTheory.IsProbabilityMeasure μ] {n M : ℕ} (hn : 0 < n) (hM : 0 < M) (Φ : Matrix (Fin n) (Fin M) ℝ) (f : Fin n → ℝ) (ε Y : Ω → Fin n → ℝ) (hModel : ∀ (ω : Ω) (i : Fin n), Y ω i = f i + ε ω i) (θhat : Ω → Fin M → ℝ) (σ : ℝ) (hσ : 0 < σ) (k : ℕ) (hk : 0 < k) (α : ℝ) (hα_pos : 0 < α) (hα_lt : α < 1) (δ : ℝ) (hδ_pos : 0 < δ) (hδ_le : δ ≤ 1) (hsubG : IsSubGaussianNoise ε σ μ) (hsubG_full : ∀ (i : Fin n), IsSubGaussian (fun (ω : Ω) => ε ω i) (σ ^ 2) μ) (hIndep : ProbabilityTheory.iIndepFun (fun (i : Fin n) (ω : Ω) => ε ω i) μ) (hMeas : ∀ (i : Fin n), Measurable fun (ω : Ω) => ε ω i) (hINC : INC_condition Φ k) (hColNorm : ∀ (j : Fin M), ∑ i : Fin n, Φ i j ^ 2 ≤ ↑n) (τ : ℝ) (hτ : 2 * τ = 8 * σ * √(2 * Real.log (2 * ↑M) / ↑n) + 8 * σ * √(2 * Real.log (1 / δ) / ↑n)) (hLasso : ∀ (ω : Ω) (θ : Fin M → ℝ), lassoObjective_35 (Y ω) Φ τ (θhat ω) ≤ lassoObjective_35 (Y ω) Φ τ θ) :

∃ (C : ℝ), 0 < C ∧ μ {ω : Ω | MSE_35 (Φ.mulVec (θhat ω)) f ≤ (⨅ (θ : { θ : Fin M → ℝ // support_size_35 θ ≤ k }), (1 + α) / (1 - α) * MSE_35 (Φ.mulVec ↑θ) f + C * σ ^ 2 / (α * (1 - α) * ↑n) * ↑(support_size_35 ↑θ) * Real.log (Real.exp 1 * ↑M)) + C * σ ^ 2 / (α * (1 - α) * ↑n) * Real.log (1 / δ)} ≥ ENNReal.ofReal (1 - δ)

Theorem 3.5 (Bundled): Sparse oracle inequality for the Lasso under INC(k). There exists an absolute constant C > 0 (C = 1024 suffices) such that under the sub-Gaussian noise model with INC(k), the Lasso with the prescribed regularization satisfies the MSE oracle inequality with probability at least 1 − δ.