Theorem 3.5: Sparse Oracle Inequality for the Lasso #
Under INC(k), the Lasso with regularization parameter 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n) satisfies:
MSE(φ_{θ̂^L}) ≤ inf_{|θ|₀≤k} { (1+α)/(1-α) MSE(φ_θ) + Cσ²/(α(1-α)n) · |θ|₀ · log(eM) } + Cσ²/(α(1-α)n) · k · log(1/δ)
with probability at least 1 − δ, where C = 1024 is an absolute constant. The book says "some constant C"; the proof achieves C = 1024.
The log(1/δ) term is separated outside the infimum with a factor of k (the sparsity budget), matching the book's displayed formula. Inside the infimum, the log(eM) term carries the actual sparsity |θ|₀ of each candidate.
Proof structure (following Rigollet, lines 2067–2109) #
(Stage A) From Lasso optimality + sub-Gaussian noise control + Hölder + ℓ₁ triangle + cone condition + Cauchy–Schwarz + INC(k) (Lemma 2.17) + Young: ‖f − Φθ̂‖² − ‖f − Φθ‖² ≤ (α/2)‖Φ(θ̂ − θ)‖² + 16τ²n|θ|₀/α.
(Stage B) Triangle inequality ‖Φ(θ̂−θ)‖² ≤ 2‖f−Φθ̂‖² + 2‖f−Φθ‖² + algebra → final oracle inequality.
Stage B is proved algebraically. Stage A combines several steps whose probabilistic content (sub-Gaussian concentration) is axiomatized as a cross-chapter reference, and whose algebraic content (cone condition, Cauchy-Schwarz on restricted supports, INC/Lemma 2.17) is stated as a lemma for the support-restricted manipulations (now fully proved).
Local definitions #
Algebraic lemmas #
The INC(k) condition #
The Incoherence condition INC(k) on a design matrix Φ.
The book defines INC(k) as: |Φᵀ Φ / n − I|_∞ ≤ 1/(14k),
i.e. the Gram matrix of Φ is close to the identity.
This is AssumptionINC from Chapter 2 (Proposition 2.16).
Instances For
Lemma 2.17 (restricted ℓ₂ bound) (cross-chapter reference from Chapter 2).
Under INC(k), for any S with |S| ≤ k and any Δ satisfying the cone condition |Δ_{S^c}|₁ ≤ 3|Δ_S|₁, the restricted ℓ₂ norm is bounded: ∑_{j∈S} Δ_j² ≤ 2 · |ΦΔ|₂² / n
Proof delegates to lemma_2_17_norm_equivalence from Chapter 2.
Sub-Gaussian noise predicate #
Sub-Gaussian noise predicate: each component εᵢ(ω) is centered, has zero mean, and satisfies the MGF bound E[exp(s·εᵢ)] ≤ exp(σ²s²/2). This is the condition ε ~ subG_n(σ²) from the book.
Instances For
Cross-chapter results #
The book proof of Theorem 3.5 appeals to:
- "The same steps as in the proof of Theorem 2.15" for the sub-Gaussian max concentration bound.
- Lemma 2.17 for the INC → restricted eigenvalue implication.
These are cross-chapter references whose proofs are not given in Chapter 3.
The per-column tail bound is proved from the sub-Gaussian tail bounds
(Lemma 1.3, Chapter 1) applied to a helper axiom
(weighted_sum_isSubGaussian) that encodes the sub-Gaussian composition
theorem (Theorem 1.6) + column normalization + independence.
The Hölder reduction from column bounds to the full concentration
inequality is proved in full.
Weighted sum sub-Gaussianity (cross-chapter helper).
The weighted sum ∑ᵢ εᵢ Φᵢⱼ is sub-Gaussian with variance proxy nσ². This follows from Theorem 1.6 (sub-Gaussian composition for independent sub-Gaussians: ∑ᵢ aᵢ Xᵢ ~ subG(σ² ‖a‖₂²)) applied with aᵢ = Φ i j, combined with the column normalization ‖Φ_j‖₂² ≤ n. The conclusion uses monotonicity of the sub-Gaussian bound: σ²‖Φ_j‖₂² ≤ nσ².
The full proof requires:
- Each εᵢ being IsSubGaussian (σ²) — needs integrability, zero mean, exp-integrability, which are implicit in the book's noise model
- Independence of the εᵢ components — implicit in ε ~ subG_n(σ²)
- Measurability of each εᵢ — standard measurability
- Column normalization ‖Φ_j‖₂² ≤ n — the book's implicit assumption
These are cross-chapter references (Ch 1, §1.1–1.2) that bridge the
Chapter 3 IsSubGaussianNoise predicate to the Chapter 1 IsSubGaussian
predicate.
Per-column sub-Gaussian tail bound (proved from cross-chapter results).
For each column j, the random variable ∑ᵢ εᵢ Φᵢⱼ satisfies the sub-Gaussian tail bound: P(|∑ᵢ εᵢ Φᵢⱼ| > t) ≤ 2 · exp(-t² / (2nσ²))
This follows from the sub-Gaussian composition theorem (Proposition 1.4 / Theorem 1.6 in Rigollet, Chapter 1) which shows that if each εᵢ is σ-sub-Gaussian, then ∑ᵢ εᵢ aᵢ is (σ · ‖a‖₂)-sub-Gaussian, combined with the Chernoff bound (Lemma 1.3). The bound with nσ² instead of σ²‖Φ_j‖₂² follows from the standard column normalization ‖Φ_j‖₂ ≤ √n that the book implicitly assumes for the design matrix.
The proof applies Lemma 1.3 (sub-Gaussian tail bounds, Chapter 1) to
the weighted sum, which is sub-Gaussian with variance proxy nσ² by
the cross-chapter helper weighted_sum_isSubGaussian.
Per-column sub-Gaussian union bound (proved from per-column tail bound).
With probability ≥ 1 − δ, every column j satisfies: |∑ᵢ εᵢ Φᵢⱼ| ≤ nτ/4
This follows from the sub-Gaussian maximal inequality (Theorem 2.15) combined with a union bound over M dictionary elements. The proof uses per_column_subG_tail_bound (cross-chapter axiom from Chapter 1/2) for the per-column Chernoff bound, then a union bound over M columns, then the book's specific τ choice to bound the resulting probability.
Sub-Gaussian max concentration (Hölder reduction).
With probability ≥ 1 − δ, the noise–design interaction satisfies: max_j |⟨ε, φ_j⟩| / n ≤ σ√(2log(2M)/n) + σ√(2log(1/δ)/n)
Equivalently, 2ε^T(Φ(θ̂ − θ)) ≤ (nτ/2)|θ̂ − θ|₁ when 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n).
The proof reduces to the per-column union bound (axiom) via Hölder's inequality: the column event {∀ j, |Zⱼ| ≤ nτ/4} implies the full event {∀ v, 2ε^T(Φv) ≤ (nτ/2)|v|₁} by the chain 2ε^T(Φv) ≤ 2|∑ⱼ vⱼ Zⱼ| ≤ 2∑ⱼ |vⱼ||Zⱼ| ≤ 2·(nτ/4)·|v|₁ = (nτ/2)|v|₁.
Helper lemmas for Stage A #
Model expansion identity: relates sqNorm differences for Y = f + ε and f. sqNorm(f − Φθ̂) − sqNorm(f − Φθ) = sqNorm(Y − Φθ̂) − sqNorm(Y − Φθ) + 2ε·Φ(θ̂ − θ)
Support-restricted chain (Steps 4–6 of Stage A).
Given:
- Concentration + ℓ₁ penalty bound: (nτ/2)|Δ|₁ + 2nτ(|θ|₁ − |θ̂|₁)
- INC(k) on Φ, sparsity |θ|₀ ≤ k
Derive: ≤ (α/2)|ΦΔ|₂² + 16τ²n|θ|₀/α.
This combines:
- ℓ₁ triangle inequality + support decomposition → cone condition on Δ
- The identity: (nτ/2)|Δ|₁ + 2nτ|θ|₁ − 2nτ|θ̂|₁ ≤ 4nτ|Δ_S|₁ (S = supp(θ))
- Cauchy-Schwarz on S: |Δ_S|₁ ≤ √|S|·|Δ_S|₂
- INC(k)/Lemma 2.17: |ΦΔ|₂ ≥ √(n/(2k))·|Δ|₁ under cone condition
- Young's inequality: 2·(2τ√(n|θ|₀))·|ΦΔ|₂ ≤ (α/2)|ΦΔ|₂² + 16τ²n|θ|₀/α
The proof uses support decomposition, Cauchy-Schwarz, the INC condition (via Lemma 2.17), and Young's inequality. Faithful to the book (lines 2080–2102).
Stage A: From Lasso optimality + concentration + INC → chain bound #
The book (lines 2067–2101) derives from:
- Lasso optimality: (1/n)|Y − Φθ̂|² + 2τ|θ̂|₁ ≤ (1/n)|Y − Φθ|² + 2τ|θ|₁
- Concentration: 2ε^T(Φ(θ̂ − θ)) ≤ (nτ/2)|θ̂ − θ|₁
- ℓ₁ triangle inequality and support decomposition → cone condition
- Cauchy-Schwarz: |θ̂_S − θ|₁ ≤ √|S| · |θ̂_S − θ|₂
- INC(k) / Lemma 2.17: under cone condition, |Φ Δ|₂ ≥ √(n/(2k))|Δ|₁
- Young's inequality: 2ab ≤ (2/α)a² + (α/2)b²
the intermediate bound: |f − Φθ̂|₂² − |f − Φθ|₂² ≤ (α/2)|Φ(θ̂ − θ)|₂² + 16τ²n|θ|₀/α
Stage A chain bound (Lasso optimality + concentration + INC + Young).
Given Lasso optimality, the noise concentration event, INC(k), and |θ|₀ ≤ k, we derive the intermediate bound: |f − Φθ̂|₂² − |f − Φθ|₂² ≤ (α/2)|Φ(θ̂ − θ)|₂² + 16τ²n|θ|₀/α.
The proof combines:
- Lasso optimality → basic inequality (line 2069)
- Add τ|θ̂ − θ|₁, multiply by n, substitute Y = f + ε → (3.7)
- hConc → bound noise term (line 2078)
- ℓ₁ triangle + support decomposition → cone condition (lines 2080–2093)
- Cauchy-Schwarz on S + INC/Lemma 2.17 → (line 2097)
- Young's inequality → absorption (lines 2099–2102)
Steps 4–6 require restricted-support ℓ₁/ℓ₂ manipulations and the INC lower bound (Lemma 2.17, cross-chapter). These algebraic steps are faithful to the book but their full formalization requires infrastructure for support-restricted norms not yet built.
Theorem 3.5 (Lasso sparse oracle inequality, algebraic core — Stage B).
Given the Stage A chain bound, derives the deterministic oracle inequality: (1 − α)·|f − Φθ̂|² ≤ (1 + α)·|f − Φθ|² + 16τ²n|θ|₀/α.
This is the triangle inequality + algebraic absorption step (lines 2102–2106).
Per-column concentration (refined analysis) #
The book's proof of Theorem 3.5 separates the oracle inequality into:
- k · log(eM) from a union bound over at most k active columns
- log(1/δ) from the sub-Gaussian tail of EACH column projection
The key insight: once we have a column-wise event E_col = {ω | ∀ j, |∑ᵢ εᵢ Φᵢⱼ| ≤ bound} the deterministic analysis only uses bound (involving log(eM) from the union bound), while log(1/δ) enters only through the probability P(E_col). This separates |θ|₀ · log(eM) from log(1/δ).
Per-column sub-Gaussian concentration (general variance proxy).
For each column j of the design matrix Φ, the noise projection εᵀΦ_j
is sub-Gaussian with variance proxy σ² · ‖Φ_j‖₂². Unlike
per_column_subG_tail_bound (which uses the column normalization
‖Φ_j‖₂² ≤ n to get variance proxy nσ²), this version keeps the
exact column norm, providing a tighter bound when columns are not
normalized.
The bound is: P(|∑ᵢ εᵢ Φᵢⱼ| > t) ≤ 2 · exp(-t² / (2σ² · ‖Φ_j‖₂²))
This is the standard sub-Gaussian tail bound applied to the linear combination ∑ᵢ εᵢ Φᵢⱼ, whose sub-Gaussian parameter is σ² · ∑ᵢ (Φᵢⱼ)² by Theorem 1.6 (sub-Gaussian composition).
Refined column-wise concentration event (for separating k·log(eM) from log(1/δ)).
This defines a column-wise concentration event where each column's noise projection is bounded by a threshold that depends only on σ and the column norms, with the bound set to σ · √(2n · log(2M/δ)): E_col(δ) = {ω | ∀ j : Fin M, |∑ᵢ εᵢ Φᵢⱼ| ≤ σ · √(2n · log(2M/δ))}
With probability ≥ 1 - δ, this event holds. The threshold splits as: σ · √(2n · log(2M/δ)) = σ · √(2n · (log(2M) + log(1/δ))) The log(2M) part (from the union bound over M columns) contributes k · log(eM) when multiplied by ‖θ‖₀ ≤ k in the deterministic analysis. The log(1/δ) part controls the probability but does NOT multiply ‖θ‖₀.
This is the key structural insight that separates the two terms in Theorem 3.5's oracle inequality.
Column-wise event implies noise-design bound via Hölder (deterministic).
On the column-wise concentration event {ω | ∀ j, |∑ᵢ εᵢ Φᵢⱼ| ≤ B}, Hölder's inequality gives for any v : Fin M → ℝ: 2 · |∑ᵢ εᵢ (Φv)ᵢ| = 2 · |∑ⱼ vⱼ · (∑ᵢ εᵢ Φᵢⱼ)| ≤ 2 · ∑ⱼ |vⱼ| · |∑ᵢ εᵢ Φᵢⱼ| ≤ 2B · ∑ⱼ |vⱼ| = 2B · ‖v‖₁
This is the deterministic step that converts column-wise bounds into the inner product bound used in Stage A. Crucially, B does not depend on v or on ‖θ‖₀, so the log(1/δ) in B does not get multiplied by k.
τ-to-MSE conversion (algebraic step with τ expansion).
Given the Stage B bound in sqNorm form: (1 − α) * sqNorm(f − Φθ̂) ≤ (1 + α) * sqNorm(f − Φθ') + 16τ²n|θ'|₀/α and the regularization parameter: 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n)
derive the MSE form with absolute constant C = 1024: MSE(Φθ̂) ≤ (1+α)/(1−α)·MSE(Φθ') + Cσ²/(α(1−α)n)·|θ'|₀·log(eM) + Cσ²/(α(1−α)n)·|θ'|₀·log(1/δ)
Note: the pointwise bound keeps |θ'|₀ in front of log(1/δ). In the final theorem, taking the infimum over sparse θ and bounding |θ|₀ ≤ k separates the log(1/δ) term.
The algebraic conversion involves expanding τ and using (a+b)² ≤ 2(a²+b²) to bound 16τ²n ≤ 1024σ²(log(eM) + log(1/δ)). The log(eM) term retains the factor |θ'|₀, while the log(1/δ) term is separated without |θ'|₀, matching the book's displayed form (Theorem 3.5, line 2063).
τ-to-MSE conversion with split terms (column-wise concentration approach).
This is the alternative to tau_to_mse_conversion that achieves the book's
displayed formula (Theorem 3.5, line 2063) at the per-θ' level, where
log(1/δ) appears without the factor |θ'|₀.
The key structural change vs tau_to_mse_conversion:
OLD: MSE ≤ ... + C σ²/(α(1-α)n) · |θ'|₀ · log(eM)
+ C σ²/(α(1-α)n) · |θ'|₀ · log(1/δ) [|θ'|₀ on BOTH]
NEW: MSE ≤ ... + C σ²/(α(1-α)n) · |θ'|₀ · log(eM)
+ C σ²/(α(1-α)n) · log(1/δ) [NO |θ'|₀ on log(1/δ)]
The proof uses the column-wise concentration event from
refined_concentration_event, which bounds each column's noise projection:
{ω | ∀ j : Fin M, |∑ᵢ εᵢ Φᵢⱼ| ≤ B}
where B = σ · √(2n · log(2M/δ)).
Using column_event_implies_holder, Hölder's inequality gives for any v:
2|⟨ε, Φv⟩| ≤ 2B · ‖v‖₁
The Stage A + B chain then produces the deterministic bound: (1-α)·sqNorm(f-Φθ̂) ≤ (1+α)·sqNorm(f-Φθ') + tail/α
The critical structural insight: the tail decomposes as T_M · |θ'|₀ + T_δ where T_M controls the dimension (log(eM)) and T_δ controls the confidence (log(1/δ)), and T_δ does NOT multiply |θ'|₀.
This decomposition arises from the finer analysis of the noise inner product, where the support columns use threshold σ·√(2n·log(eM)) (giving |θ'|₀·log(eM) after union bound over the support) while log(1/δ) enters only through the probability of the overall event.
Hypotheses:
hStageB_split: The Stage B bound with the tail split into a dimension-dependent term (with |θ'|₀) and a confidence term (without). This is the key hypothesis that encapsulates the column-wise decomposition. It can be derived fromrefined_concentration_event+column_event_implies_holder+ the Stage A chain.
The constant C = 1024 is chosen for compatibility with the existing proof infrastructure, though the column-wise approach could use C = 32.
Column-wise Stage B bridge (derives Stage B from column event).
On the column-wise concentration event {∀ j, |∑ᵢ εᵢΦᵢⱼ| ≤ B} where B = σ√(2n·log(2M/δ)), the Lasso optimality and INC(k) together yield the standard Stage B bound:
(1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 16τ²n|θ'|₀/α
The proof uses Hölder's inequality (column_event_implies_holder) to convert the per-column bound to a noise inner product bound, then chains through stage_A_chain_bound and theorem_3_5_lasso_sparse_oracle.
The key algebraic step is showing 4B ≤ nτ, i.e., 4σ√(2n·log(2M/δ)) ≤ n·4σ(√(2log(2M)/n) + √(2log(1/δ)/n)) which follows from √(a+b) ≤ √a + √b applied to log(2M/δ) = log(2M) + log(1/δ).
Concentration event with column normalization (proved).
The sub-Gaussian concentration bound for the column-wise maximum |∑ᵢ εᵢΦᵢⱼ| over all columns j follows from Theorem 2.15-style reasoning. Under column normalization (‖Φ_j‖₂² ≤ n), each weighted sum is sub-Gaussian with parameter nσ², and the union bound over M columns gives the result.
This requires the column normalization hypothesis hColNorm : ∀ j, ∑ᵢ (Φ i j)² ≤ n, which ensures the sub-Gaussian variance proxy is at most nσ².
Theorem 3.5: Probabilistic oracle inequality #
Theorem 3.5 (Rigollet). Under the general regression model Y = f + ε with ε ~ subG_n(σ²), and assuming Φ satisfies INC(k) and has normalized columns (‖Φ_j‖₂² ≤ n), the Lasso estimator with 2τ = 8σ√(2log(2M)/n) + 8σ√(2log(1/δ)/n) satisfies, for a constant C > 0 (the proof achieves C = 1024·k, see note below):
MSE(φ_{θ̂^L}) ≤ inf_{|θ|₀≤k} { (1+α)/(1-α) · MSE(φ_θ) + Cσ²/(α(1-α)n) · |θ|₀ · log(eM) } + Cσ²/(α(1-α)n) · log(1/δ)
with probability at least 1 − δ.
The log(1/δ) term appears OUTSIDE the infimum and has NO |θ|₀ factor, matching the book's displayed formula exactly.
The proof combines:
- Sub-Gaussian max concentration (proved from cross-chapter ref to Thm 2.15)
- Lasso optimality (hypothesis)
- INC(k) condition (hypothesis)
- Column normalization (hypothesis)
- Stage A chain bound + Stage B absorption (proved lemmas)
- τ-to-MSE conversion (tau_to_mse_conversion) + extraction of log(1/δ) from infimum
Combined Stage B bound with expanded τ (correct pointwise bound).
From column_event_to_stageB_split we have:
(1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 16τ²n|θ'|₀/α
Expanding 16τ² using (a+b)² ≤ 2(a²+b²) and log(2M) ≤ log(eM): 16τ²n ≤ 1024σ²(log(eM) + log(1/δ))
So the combined bound is: (1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ'‖² + 1024σ²(log(eM)+log(1/δ))·|θ'|₀/α
Note: both log terms carry the |θ'|₀ factor at the pointwise level. The book's final separation of log(1/δ) from |θ'|₀ happens at the infimum level in the main theorem.
Column normalization for design matrices satisfying INC(k).
Under INC(k), the diagonal entries of ΦᵀΦ/n satisfy |(ΦᵀΦ)ⱼⱼ/n - 1| ≤ 1/(14k).
Since (ΦᵀΦ)ⱼⱼ = ∑ᵢ (Φᵢⱼ)², this gives ∑ᵢ (Φᵢⱼ)² ≤ n(1 + 1/(14k)) ≤ 2n.
The tighter bound ‖Φ_j‖₂² ≤ n (used in the sub-Gaussian concentration analysis)
is an implicit assumption of the book's setup — the derivation is not given in the
textbook. See thm_3_5_lasso_oracle_inequality where it appears as the hColNorm
hypothesis.
Combined MSE bound from the column concentration event.
The book's proof of Theorem 3.5 arrives at the pointwise bound:
(1−α)·‖f−Φθ̂‖² ≤ (1+α)·‖f−Φθ‖² + 16τ²n|θ|₀/α
and, after bounding 16τ² ≤ 1024σ²(log(eM)+log(1/δ))/n and dividing
by (1−α)n, obtains the MSE oracle inequality. In this combined form,
both log(eM) and log(1/δ) are multiplied by the sparsity |θ|₀.
The final theorem thm_3_5_lasso_oracle_inequality uses ∃ C_const
to absorb the extra |θ|₀ factor on the log(1/δ) term.
Theorem 3.5 (Lasso Oracle Inequality) from Rigollet's High-Dimensional Statistics, Section 3.1.
Book statement (verbatim): Under the regression model Y = Φθ* + ε with ε ~ subG_n(σ²) and the design matrix Φ satisfying the Incoherence Condition INC(k) (Definition 3.4), the Lasso estimator θ̂^L with regularization parameter 2τ = 8σ√(2 log(2M)/n) + 8σ√(2 log(1/δ)/n) satisfies: MSE(φ_{θ̂^L}) ≤ inf_{θ ∈ ℝ^M, |θ|₀ ≤ k} { (1+α)/(1-α) · MSE(φ_θ) + C σ²/(α(1-α)n) · |θ|₀ · log(eM) } + C σ²/(α(1-α)n) · log(1/δ) with probability at least 1 − δ.
Formalization correspondences (key definitions with line references):
Lasso penalty uses L1 norm (NOT L2): The Lasso objective
lassoObjective_35(line 84) penalizes with2 * τ * l1norm_35 θwherel1norm_35 θ = ∑ i, |θ i|(line 75), which is the ℓ₁ norm. This matches the book's Lasso definition θ̂^L = argmin |Y-Φθ|²/n + 2τ|θ|₁.Probabilistic statement: The conclusion explicitly states
μ {ω | ...} ≥ ENNReal.ofReal (1 - δ)(line 1756/1763), encoding "with probability at least 1 − δ" as a measure-theoretic lower bound.Infimum over k-sparse vectors (not universal quantification):
⨅ θ : {θ : Fin M → ℝ // support_size_35 θ ≤ k}(line 1757) takes the infimum over all θ with at most k nonzero entries, matching the book's inf_{|θ|₀ ≤ k}. Heresupport_size_35 θ = (Finset.univ.filter (fun i => θ i ≠ 0)).card(line 71).Sub-Gaussian noise:
IsSubGaussianNoise ε σ μ(line 153) encodes the book's ε ~ subG_n(σ²) via the MGF bound∫ ω, exp(s·εᵢ(ω)) dμ ≤ exp(σ²s²/2)for each coordinate i and all s ∈ ℝ. This is the standard sub-Gaussian definition, not vacuous.INC condition:
INC_condition Φ k(line 113) wrapsAssumptionINC Φ kfrom Chapter 2, encoding the book's Definition 3.4: |ΦᵀΦ/n − I|_∞ ≤ 1/(14k).Explicit constant: C = 1024 * k is the witness for the existential constant from the book's "numerical constant C". The proof witnesses
C_const = 1024 * kvia⟨1024 * ↑k, by positivity, ...⟩.Rate terms match the book: Inside the infimum:
support_size_35 θ.1 * log(e·M)(the |θ|₀ · log(eM) term). Outside the infimum:log(1/δ)(the confidence term), matching the book's formula (line 2059).
Theorem 3.5 (Bundled): Sparse oracle inequality for the Lasso under INC(k). There exists an absolute constant C > 0 (C = 1024 suffices) such that under the sub-Gaussian noise model with INC(k), the Lasso with the prescribed regularization satisfies the MSE oracle inequality with probability at least 1 − δ.