Theorem 2.2: Ridge/Least Squares Risk Bound #
Main results #
thm_2_2_ridge_risk_bound: The bundled theorem combining both parts: (1) E[MSE] ≤ 16σ²r/n, and (2) With probability ≥ 1−δ, MSE ≤ 64σ²(r+log(1/δ))/n.thm_2_2_fundamental_inequality: The algebraic core (equation 2.5): |X(θ̂ − θ*)|² ≤ 2 · εᵀ X(θ̂ − θ*).thm_2_2_ls_expected_mse: Part 1 — Expected MSE bound.thm_2_2_ls_high_prob_mse: Part 2 — High-probability MSE bound.
Proof strategy #
The proof is purely algebraic at its core and uses two ingredients:
Minimality: Since θ̂ minimizes |Y − Xθ|², we have |Y − Xθ̂|² ≤ |Y − Xθ*|² = |ε|².
Expansion: Setting v = X(θ̂ − θ*), we have |Y − Xθ̂|² = |ε − v|² = |ε|² − 2⟨ε, v⟩ + |v|².
Combining: |ε|² − 2⟨ε, v⟩ + |v|² ≤ |ε|², so |v|² ≤ 2⟨ε, v⟩.
The full Theorem 2.2 combines this with sub-Gaussian tail bounds (Theorem 1.19) applied to ε̃ = Φᵀε (where Φ is an orthonormal basis for col(X)) to obtain: E[MSE] ≲ σ²r/n where r = rank(XᵀX).
References #
- Rigollet, High-Dimensional Statistics, Theorem 2.2, equation (2.5)
Theorem 2.2 — Fundamental Inequality (algebraic core).
Under the linear model Y = Xθ* + ε, if θ̂ minimizes the squared residual |Y - Xθ|² = (Y - Xθ) ⬝ᵥ (Y - Xθ) over all θ, then
|X(θ̂ - θ*)|² ≤ 2 · εᵀ X(θ̂ - θ*)
where |v|² = v ⬝ᵥ v is the squared Euclidean (L₂) norm.
This is equation (2.5) in Rigollet, Chapter 2.
Theorem 2.2 — Fundamental Inequality (EuclideanSpace norm version).
The same result as thm_2_2_fundamental_inequality, but with the conclusion
expressed using the L₂ norm: ‖X(θ̂ - θ*)‖²_L2 ≤ 2 · εᵀ X(θ̂ - θ*).
Lemma 1.4 consequence: Sub-Gaussian variance bound.
For Z ~ subG(σ²) (i.e., E[exp(sZ)] ≤ exp(s²σ²/2) for all s), the second moment satisfies E[Z²] ≤ 4σ².
This is the k=2 case of Lemma 1.4 (Rigollet, Chapter 1): E[|Z|^k] ≤ (2σ²)^{k/2} · k · Γ(k/2) which gives E[Z²] ≤ (2σ²)^1 · 2 · Γ(1) = 2σ² · 2 · 1 = 4σ².
The full proof via the layer cake formula is in Rigollet.Chapter1.Lemma_1_4.
Note on the formalization gap: The textbook implicitly assumes
exp(sZ) is integrable (as in Definition 1.2 / IsSubGaussian). The
hypothesis below uses the Bochner integral, which returns 0 for
non-integrable functions, making the MGF bound vacuously true when
exp(sZ) is not integrable. In the downstream usage
(subG_expected_squared_norm_bound), the hypothesis always comes from
a full sub-Gaussian condition where integrability holds.
ONB existence with Parseval identity.
For any n×d real matrix X with r = rank(XᵀX), there exist r unit vectors φ₁,...,φᵣ ∈ ℝⁿ forming an orthonormal basis for the column space of X, satisfying the Parseval identity and the linear expansion property.
The construction uses stdOrthonormalBasis from Mathlib applied to the
column space of X transported to EuclideanSpace ℝ (Fin n).
Integrability from sub-Gaussian MGF bound.
If a random variable Z satisfies the sub-Gaussian MGF bound E[exp(s·Z)] ≤ exp(s²σ²/2) for all s, and exp(s·Z) is integrable for all s, then Z is integrable and Z² is integrable.
The integrability of exp(s·Z) is a standard part of the sub-Gaussian definition
(see IsSubGaussian in Rigollet.Chapter1.Def_1_2). The textbook implicitly
assumes this in the proof of Theorem 2.2 when it asserts that ε̃ = Φᵀε has
integrable components.
ONB decomposition with integrability.
For any n×d real matrix X with r = rank(XᵀX), there exist r unit vectors φ₁,...,φᵣ ∈ ℝⁿ forming an orthonormal basis for the column space of X, such that:
Pointwise Cauchy-Schwarz bound: For any δ ∈ ℝᵈ and noise v ∈ ℝⁿ, if |Xδ|² ≤ 2⟨v, Xδ⟩ (the fundamental inequality), then |Xδ|² ≤ 4 · Σᵢ (v ⬝ᵥ φᵢ)².
Proof (from Rigollet, Theorem 2.2): Write Xδ = Σ cᵢφᵢ (since Xδ ∈ col(X)). Then ⟨v,Xδ⟩ = Σ cᵢ⟨v,φᵢ⟩. By Cauchy-Schwarz, (Σ cᵢ⟨v,φᵢ⟩)² ≤ (Σ cᵢ²)(Σ⟨v,φᵢ⟩²) = |Xδ|²·(Σ⟨v,φᵢ⟩²). Combined with |Xδ|² ≤ 2⟨v,Xδ⟩, this gives |Xδ|² ≤ 4Σ⟨v,φᵢ⟩² by
nlinarith.Integrability: Each ε ⬝ᵥ φᵢ is integrable, has integrable exponential moments, and (ε ⬝ᵥ φᵢ)² is integrable.
Since dotProduct(φᵢ, φᵢ) = 1, the sub-Gaussian condition directly gives the MGF bound for ε ⬝ᵥ φᵢ, and
integrable_of_mgf_boundyields all integrability conditions.
Sup-out and ONB decomposition bound.
Combining the fundamental inequality with the "sup-out" technique and the orthonormal basis decomposition of col(X), the expected squared prediction error is bounded by the sum of projected variances:
E[|X(θ̂-θ*)|²] ≤ 4 · Σᵢ₌₁ʳ E[ε̃ᵢ²] ≤ 16σ²r
where ε̃ = Φᵀε and Φ is an ONB for col(X).
Proof (Rigollet, Theorem 2.2):
- Fundamental inequality: ∀ω, |X(θ̂-θ*)|² ≤ 2⟨ε, X(θ̂-θ*)⟩
- For v = X(θ̂-θ*) in col(X), write v = Φν, so ⟨ε,v⟩ = ε̃ᵀ(ν/|ν|)·|ν| ≤ |v|·sup_{‖u‖≤1} ε̃ᵀu
- Therefore |v|² ≤ 2|v|·sup ⟹ |v| ≤ 2·sup ⟹ |v|² ≤ 4·(sup)²
- sup_{‖u‖≤1} (ε̃ᵀu)² = |ε̃|² = Σᵢ ε̃ᵢ² (optimum at u = ε̃/|ε̃|)
- E[Σᵢ ε̃ᵢ²] = Σᵢ E[ε̃ᵢ²] ≤ Σᵢ 4σ² = 4σ²r (by
subG_variance_bound) - Combined: E[|v|²] ≤ 4·4σ²r = 16σ²r
The ONB construction and integrability conditions are provided by
exists_onb_pointwise_bound_with_integrability. The rest of the proof
(integration, sum-integral interchange, and application of hvar) is
formally verified below.
Sub-Gaussian MGF bound with σ=0 implies projection is zero a.e.
From ∫ exp(s · ε⬝ᵥv) ≤ 1 for all s ∈ ℝ (with v a unit vector), the random variable Z = ε⬝ᵥv satisfies Z = 0 a.s.
Proof (Markov's inequality): For t > 0 and s > 0, P(Z ≥ t) ≤ E[exp(sZ)] · exp(-st) ≤ exp(-st) → 0 as s → +∞. Similarly P(Z ≤ -t) → 0 as s → -∞. Hence P(Z ≠ 0) = 0.
This argument requires measurability of Z = ε⬝ᵥv, which is implicit in the
probability theory setting of the textbook. The formalization uses the Bochner
integral without explicit measurability hypotheses, creating a gap that is
bridged by sorry below.
Zero sub-Gaussian parameter implies zero prediction error.
When σ = 0, the sub-Gaussian condition E[exp(s·εᵀv)] ≤ 1 for all s and unit vectors v implies that εᵀv = 0 a.s. for every direction v. Combined with the fundamental inequality |X(θ̂-θ*)|² ≤ 2⟨ε, X(θ̂-θ*)⟩, this forces |X(θ̂-θ*)|² = 0 a.s., and therefore ∫ |X(θ̂-θ*)|² dμ ≤ 0.
The proof that E[exp(sZ)] ≤ 1 for all s implies Z = 0 a.s. uses Markov's
inequality: P(Z ≥ t) ≤ E[exp(sZ)]·exp(-st) ≤ exp(-st) → 0 as s → ∞.
This step is captured by the axiom subG_zero_dot_product_ae_zero.
Expected squared norm bound for sub-Gaussian noise.
For sub-Gaussian noise ε ~ subG_n(σ²), the expected squared prediction error satisfies E[|X(θ̂ - θ*)|²] ≤ 16 σ² r where r = rank(XᵀX).
The proof decomposes into:
thm_2_2_fundamental_inequality: |X(θ̂-θ*)|² ≤ 2⟨ε, X(θ̂-θ*)⟩ (proved)sup_out_onb_expectation_bound: sup-out + ONB decomposition + Lemma 1.4subG_variance_bound: E[Z²] ≤ 4σ² for Z ~ subG(σ²) (axiom from Ch1)
Theorem 2.2, Part 1 — Expected MSE bound for Least Squares.
Under the linear model Y = Xθ* + ε where ε ~ subG_n(σ²), the least squares estimator θ̂ satisfies
E[MSE(Xθ̂)] = (1/n) E[|Xθ̂ - Xθ*|²] ≤ 16 σ² r / n
where r = rank(XᵀX). The proof uses the fundamental inequality (equation 2.5) combined with the "sup-out" technique and Lemma 1.4 to bound 4 E[sup_{u ∈ B₂} (ε̃ᵀu)²] = 4 Σᵢ E[ε̃ᵢ²] ≤ 16 σ² r.
Rigollet, Chapter 2, Theorem 2.2, first display equation.
Derive Bochner MGF bound from Lebesgue integral MGF bound.
For nonneg measurable functions, the Lebesgue integral bound
∫⁻ ofReal(exp(s*X)) ≤ ofReal(exp(σ²s²/2)) implies the Bochner integral bound
∫ exp(s*X) ≤ exp(σ²s²/2), via integral_eq_lintegral_of_nonneg_ae.
Bridge theorem: Lebesgue MGF bound + measurability implies exp-integrability.
On a probability space, if X is measurable and the Lebesgue integral satisfies
∫⁻ ofReal(exp(sX)) ≤ ofReal(exp(σ²s²/2)) for all s, then exp(s*X) is
Bochner integrable for every s.
The key insight is that the Lebesgue integral ∫⁻ gives ⊤ (not 0) for
non-integrable nonneg functions, so the finite upper bound forces finiteness.
This is in contrast to the Bochner integral ∫ which gives 0 for non-integrable
functions, making the bound vacuous. The textbook's E[exp(sX)] for the nonneg
function exp(sX) corresponds to ∫⁻, not ∫.
If s * c ≤ exp(σsq * s² / 2) - 1 for all s : ℝ, then c = 0.
This is a pure-analysis helper for mean_zero_of_subG_mgf_bound. The proof uses
the fact that g(s) = exp(σsq * s² / 2) - 1 - s * c has a global minimum at s = 0
(where g(0) = 0), so its derivative there must vanish, giving -c = 0.
Bridge theorem: MGF bound implies mean zero.
Standard measure-theoretic fact: on a probability space,
if E[exp(sX)] ≤ exp(σ²s²/2) for all s, then E[X] = 0.
Proof: from exp(sX) ≥ 1 + sX (convexity) and the MGF bound, we get
s * E[X] ≤ exp(σ²s²/2) - 1 for all s. The function
g(s) = exp(σ²s²/2) - 1 - s * E[X] is non-negative with g(0) = 0,
so g'(0) = -E[X] = 0 by IsLocalMin.hasDerivAt_eq_zero.
MGF bound + orthonormality implies IsSubGaussian for projected noise.
The textbook delegates the sub-Gaussianity of projected noise to "the last step in the proof of Theorem 1.19" without reproducing the argument. The proof reduces the inner product ⟨a, Φᵀε⟩ to a dot product ε ⬝ᵥ w where w = Φᵀa has ‖w‖ ≤ 1 (by orthonormality), then applies the MGF hypothesis. The integrability/mean-zero/exp-integrability conditions are supplied by bridge axioms (standard measure theory not in the book).
ONB projected noise tail bound.
For an ONB Φ of the column space of X, the squared norm of the projected noise ε̃ = Φᵀε satisfies a covering-number tail bound. This follows from Theorem 1.19's ε-net argument applied to the r-dimensional sub-Gaussian vector ε̃ = (⟨ε,φ₁⟩,...,⟨ε,φᵣ⟩). The textbook delegates this to "the last step in the proof of Theorem 1.19" without reproducing the argument.
The bridge between dotProduct (Fin n → ℝ) and inner (EuclideanSpace ℝ
(Fin r)) is a representational gap; mathematically this is a direct
application of theorem_1_19_tail_bound to the projected noise vector.
6 < exp(2), needed for the covering number arithmetic.
6·exp(-2) ≤ 1, the key for the covering number bound decay.
Chi-squared tail bound from ONB decomposition + Theorem 1.19.
This captures the core probabilistic step in the high-probability bound for Theorem 2.2 (Rigollet). The textbook says (line 1179):
"Moreover, with probability 1 - δ, it follows from the last step in the proof of Theorem 1.19 that sup_{u∈B₂}(ε̃ᵀu)² ≤ 8log(6)σ²r + 8σ²log(1/δ)."
The proof combines:
- ONB construction: Let Φ be an ONB for col(X) (
exists_onb_with_parseval). - Fundamental inequality + sup-out: |X(θ̂-θ*)|² ≤ 4·Σᵢ⟨ε,φᵢ⟩².
- Covering number bound: P(Σᵢ⟨ε,φᵢ⟩² > t²) ≤ 6^r·exp(-t²/(8σ²))
(
onb_projected_noise_tail_bound, from Thm 1.19). - Arithmetic: Choose t so that 6^r·exp(-t²/(8σ²)) ≤ δ.
The constant 64 arises from: 4 (fundamental ineq) × 8 (covering number) × 2 (absorbing the log(6) factor), matching the textbook's ≲ notation.
High-probability squared norm bound for sub-Gaussian noise.
Combining the fundamental inequality with the "sup-out" technique, ONB decomposition, and sub-Gaussian concentration (Theorem 1.19): With probability ≥ 1 - δ, |X(θ̂-θ*)|² ≤ 64σ²(r + log(1/δ)).
Proof sketch (Rigollet, Theorem 2.2, Part 2):
- |X(θ̂-θ*)|² ≤ 4 sup_{u∈B₂} (ε̃ᵀu)² (from fundamental ineq + sup-out)
- With prob 1-δ: sup ≤ Cσ²(r + log(1/δ)) (from Thm 1.19 last step)
- Combined: |X(θ̂-θ*)|² ≤ 64σ²(r + log(1/δ)) w.p. 1-δ
Theorem 2.2, Part 2 — High-probability MSE bound.
Under the linear model Y = Xθ* + ε where ε ~ subG_n(σ²), for any δ > 0, with probability at least 1 - δ:
MSE(Xθ̂) = (1/n)|Xθ̂ - Xθ*|² ≤ 64 σ²(r + log(1/δ)) / n
where r = rank(XᵀX).
The proof uses the fundamental inequality (equation 2.5) combined with sub-Gaussian tail bounds (Theorem 1.19) applied to the rotated noise ε̃ = Φᵀε. The additional log(1/δ) term comes from the sub-Gaussian concentration: P(Z² ≥ t) ≤ 2exp(-t/(4σ²)) for Z ~ subG(σ²), applied with a union bound over the r coordinates of ε̃.
Rigollet, Chapter 2, Theorem 2.2, second display equation.
Theorem 2.2 — Ridge/LS risk bound (bundled).
Under the linear model Y = Xθ* + ε where ε ~ subG_n(σ²), the least squares estimator θ̂ satisfies both:
- Expected MSE: E[(1/n)|Xθ̂ - Xθ*|²] ≤ 16σ²r/n
- High-probability MSE: With probability ≥ 1-δ, (1/n)|Xθ̂ - Xθ*|² ≤ 64σ²(r + log(1/δ))/n
where r = rank(XᵀX).
Rigollet, Chapter 2, Theorem 2.2.