Helper: sqL2norm equals dotProduct #
Helper: Cauchy–Schwarz for finite sums on a subset #
(∑ j ∈ S, |f j|)² ≤ |S| · ∑ j ∈ S, f j²
Equation (2.18): the core algebraic inequality from Lasso optimality. #
From the Lasso optimality condition at θ = θ*, expanding Y = Xθ* + ε: (1/n)|Xv|² ≤ (2/n)⟨ε, Xv⟩ + 2τ(|θ*|₁ − |θ̂|₁) Triangle inequality on supports gives |θ*|₁ − |θ̂|₁ ≤ |v_S|₁ − |v_{Sᶜ}|₁. Hölder + noise condition: (2/n)|⟨ε, Xv⟩| ≤ 2τ|v|₁. Combining: (1/n)|Xv|² ≤ 4τ|v_S|₁.
Now uses IsLassoEstimatorL2 so the sqL2norm expansion
sqL2norm(a - b) = sqL2norm a - 2⟨a,b⟩ + sqL2norm b
is valid (it fails for the sup norm).
Cone condition from Lasso optimality. #
The cone condition |v_{Sᶜ}|₁ ≤ 3|v_S|₁ is needed for Lemma 2.17. In the book's proof, it follows from the basic inequality combined with the non-negativity of (1/n)|Xv|². With the penalty 2τ and noise bound |(Xᵀε)_j/n| ≤ τ, the derivation requires care about the sign of the cross term.
Helper: A² ≤ c·A with A ≥ 0 implies A ≤ c #
Theorem 2.18 (Lasso fast rate — MSE form).
Under INC(k), |θ*|₀ ≤ k, and linear model Y = Xθ* + ε, the Lasso estimator satisfies (1/n)|X(θ̂ − θ*)|₂² ≤ 576·k·τ² provided |Xᵀε|_∞ ≤ nτ (deterministic condition).
Proof outline (Rigollet, Chapter 2).
- Eq (2.18): A ≤ 4τ · |v_S|₁ (Lasso optimality + Hölder + triangle ineq)
- Cauchy–Schwarz: |v_S|₁ ≤ √k · |v_S|₂
- Lemma 2.17: |v_S|₂² ≤ 2A (INC + cone condition)
- Chain: A ≤ 4τ√k · √(2A), square ⟹ A ≤ 32kτ² ≤ 576kτ²
Theorem 2.18 (Lasso fast rate — ℓ₂² parameter-error bound).
Under INC(k), |θ*|₀ ≤ k, and linear model Y = Xθ* + ε, the Lasso estimator satisfies |θ̂ − θ*|₂² ≤ 576·k·τ².
Note: This is an auxiliary bound on the ℓ₂² error. The book's
Theorem 2.18 states the second bound in terms of the ℓ₁ norm;
see thm_2_18_lasso_fast_rate_l1 below.
Theorem 2.18 (Lasso fast rate — ℓ₁ parameter-error bound).
Under INC(k), |θ*|₀ ≤ k, and linear model Y = Xθ* + ε, the Lasso estimator satisfies |θ̂ − θ*|₁ ≤ 32·k·τ.
Proof (from the book): From (2.18), |v|₁ ≤ 4|v_S|₁. By Cauchy–Schwarz: |v_S|₁ ≤ √k · |v_S|₂. By Lemma 2.17: |v_S|₂ ≤ √(2/n) · |Xv|₂. From the MSE bound: |Xv|₂² ≤ 32nkτ². Combining: |v|₁ ≤ 4√k · √(2/n) · √(32nkτ²) = 32kτ.
Deterministic combined version of Theorem 2.18. #
Given the deterministic noise condition |Xᵀε|∞ ≤ nτ/2, both the MSE and ℓ₁ bounds hold simultaneously.
Theorem 2.18 (Lasso fast rate — deterministic combined).
If the noise condition holds, both bounds follow simultaneously: (1) MSE: (1/n)|X(θ̂-θ*)|₂² ≤ 576·k·τ² (2) ℓ₁: |θ̂-θ*|₁ ≤ 32·k·τ
Probabilistic version of Theorem 2.18. #
The book states Theorem 2.18 with: 2τ = 8σ√(log(2d)/n) + 8σ√(log(1/δ)/n) and "with probability at least 1-δ" the MSE and ℓ₁ bounds hold.
The probabilistic wrapper follows the same pattern as theorem_2_15_probabilistic: on the good event (complement of {∃ j, |Xⱼᵀε| > nτ/2}), the deterministic bound applies, and the bad event has probability ≤ δ by sub-Gaussian concentration + union bound.
Theorem 2.18 (Lasso fast rate — measure-theoretic probabilistic version).
Under the linear model Y(ω) = Xθ* + ε(ω) where the Lasso is optimal for each realization ω, if the probability of the bad noise event {ω | ∃ j, |Xⱼᵀε(ω)| > nτ/2} is at most δ, then with probability ≥ 1-δ: (1) MSE: (1/n)|X(θ̂-θ*)|₂² ≤ 576·k·τ² (2) ℓ₁: |θ̂-θ*|₁ ≤ 32·k·τ
The proof reduces to the deterministic core on the good event.
Theorem 2.18 (Lasso fast rate — with the book's specific τ choice).
Fix n ≥ 2. Under sub-Gaussian noise with per-column tail bounds, with the specific choice 2τ = 8σ√(log(2d)/n) + 8σ√(log(1/δ)/n) (equivalently τ = 4σ√(log(2d)/n) + 4σ√(log(1/δ)/n)), the probability of the bad event is at most δ.
Note: The concentration step |Xᵀε|∞ ≤ nτ/2 w.h.p. follows from the union bound + sub-Gaussian tail bounds established in Theorem 2.15. The bridging from the τ choice to (nτ/2)²/(2nσ²) ≥ log(2d/δ) is axiomatized since it involves the same calculation as Prop 2.16.
Expectation bounds. #
The book states: E[MSE(X θ̂^L)] ≲ kσ² log(2d)/n E[|θ̂^L - θ*|₁] ≲ kσ √(log(2d)/n)
The proof says "The bound in expectation follows using the same argument as in the proof of Corollary 2.9." Since the expectation integration argument is not given explicitly, we state these as axioms.
Note: the expectation bounds follow from integrating the high-probability bounds over δ ∈ (0,1). The key idea is: E[Z] = ∫₀^∞ P(Z > t) dt ≤ bound + ∫_{bound}^∞ P(Z > t) dt which yields the stated rates.
Axiom: Tail integration for the Lasso MSE (Corollary 2.9 argument).
The book says "The bound in expectation follows using the same argument as in the proof of Corollary 2.9" (line 1823 of the textbook). Corollary 2.9's proof uses the layer cake formula E[Z] = ∫₀^∞ P(Z > t) dt together with the high-probability bound. Since the proof of Theorem 2.18 does not give the details (it only references Cor. 2.9), we axiomatize the tail integration sub-step.
Specifically: given • a non-negative random variable Z with the deterministic bound Z(ω) ≤ 576·k·τ² on the event where |Xᵀε|∞ ≤ nτ/2, • a concentration bound P(bad event) ≤ 2d·exp(-(nτ/2)²/(2nσ²)), • τ = 4σ√(log(2d)/n) so that P(bad event) ≤ 1/(2d), the expectation E[Z] is finite and bounded by C·k·σ²·log(2d)/n.
The full argument uses the sub-Gaussian tail structure to bound the contribution from the bad event (not just the good event).
Theorem 2.18 (Lasso — MSE expectation bound).
E[MSE(X θ̂^L)] ≤ C·k·σ²·log(2d)/n.
The book defers the proof to "the same argument as in the proof of
Corollary 2.9". The tail integration step is axiomatized via
tail_integration_lasso_mse; here we compose it with the
deterministic MSE bound from thm_2_18_lasso_fast_rate_mse.
Axiom: Parametric ℓ₁ tail bound for the Lasso.
Under the Lasso optimality condition (with regularization parameter τ₀) and sub-Gaussian noise, for every δ ∈ (0,1):
P(|θ̂-θ*|₁ > 32·k·(4σ√(log(2d)/n) + 4σ√(log(1/δ)/n))) ≤ δ
The proof combines:
- The concentration bridge (thm_2_18_concentration_bridge): P(∃j, |Xⱼᵀε| > nτ(δ)/2) ≤ δ where τ(δ) = τ₀ + 4σ√(log(1/δ)/n)
- The Lasso feasibility bound: the ℓ₁ error is a.s. bounded by a quantity involving ‖ε‖²/(nτ₀), giving Z ≤ 32kτ(δ) on the good event
- The monotonicity of the bad event: P(Z > 32kτ(δ)) ≤ P(bad(τ(δ))) ≤ δ
Note: This requires showing that on the good event for τ(δ) where ∀j, |Xⱼᵀε| ≤ nτ(δ)/2, the Lasso solved with τ₀ satisfies Z ≤ 32kτ(δ). This holds because τ(δ) ≥ τ₀ and the ℓ₁ bound from Thm 2.18 is monotone in the noise threshold (the Lasso optimality condition with τ₀ ≤ τ(δ) implies the basic inequality with the larger threshold). The full argument is not given in the textbook (which references Cor 2.9), so we axiomatize this tail bound.
Tail integration for the Lasso ℓ₁ bound (Corollary 2.9 argument).
The book says "The bound in expectation follows using the same argument as in the proof of Corollary 2.9" (line 1823 of the textbook). Corollary 2.9's proof uses the layer cake formula E[Z] = ∫₀^∞ P(Z > t) dt together with the high-probability bound.
The proof combines lasso_l1_parametric_tail_bound (axiomatized parametric
tail bound) with layer_cake_parametric_tail_bound (axiomatized layer-cake
integration), following the same pattern as layer_cake_lasso_expected_mse.
Theorem 2.18 (Lasso — ℓ₁ expectation bound).
E[|θ̂^L - θ*|₁] ≤ C·k·σ·√(log(2d)/n).
The book defers the proof to "the same argument as in the proof of
Corollary 2.9". The tail integration step (layer-cake formula
E[Z] = ∫₀^∞ P(Z > t) dt combined with the sub-Gaussian tail bound)
is axiomatized via tail_integration_lasso_l1; here we compose it
with the deterministic ℓ₁ bound from thm_2_18_lasso_probabilistic.