Theorem 2.14 (BIC estimator MSE bound). Under the linear model Y = Xθ* + ε with ε sub-Gaussian(σ²), the BIC estimator with τ² ≍ σ²log(ed)/n satisfies: |X(θ̂ᴮᴵᶜ - θ*)|² ≤ 2nτ²|θ*|₀ + t with probability 1 - δ (where t depends on σ²log(1/δ)).
Formalized here as the deterministic algebraic bound that follows from BIC optimality (using sqL2norm) plus the "sup-out" noise bound.
The noise hypothesis encodes the probabilistic bound from the proof: after applying Young's inequality 2ab ≤ 2a² + b²/2 and normalizing, one obtains 4(εᵀu)² ≤ 2nτ²|θ̂|₀ + t where u = Xδ/|Xδ|₂. In quadratic form (avoiding division by |Xδ|₂): 4⟨ε,Xδ⟩² ≤ |Xδ|₂² · (2nτ²|θ̂|₀ + t).
Probabilistic form of Theorem 2.14 #
Coordinate-wise sub-Gaussian implies dot-product MGF bound. This encapsulates the fact that if each εᵢ is sub-Gaussian(σ²), then for any unit vector v with ‖v‖₂² ≤ 1, the linear combination ⟨ε, v⟩ satisfies the MGF bound E[exp(s·⟨ε,v⟩)] ≤ exp(s²σ²/2).
In the book's model, the noise coordinates are independent, so this follows from Proposition 1.4 (Theorem 1.6 in Chapter 1). The independence is part of the statistical model definition, not a theorem to be proved. This helper bridges the formalization gap between coordinate-wise IsSubGaussian and the joint MGF bound.
Note: This theorem requires independence and measurability of the noise coordinates, which are part of the statistical model definition. The proof uses Theorem 1.6 (sub-Gaussian linear combinations).
Axiom (Theorem 1.19 application): Single-support-set concentration bound.
For a fixed support set S of size k, the book (lines 1604-1614) constructs an ONB for span{X_j : j ∈ S ∪ supp(θ*)} (dimension ≤ k + |θ*|₀), applies Theorem 1.19 to the sub-Gaussian maximum of ε projected onto each ONB vector, and bridges the quadratic form to inner product form.
This captures the "by reference to Theorem 1.19" step for a single support set, including:
- ONB construction for column subspaces (book line 1604),
- Bridging from Matrix.mulVec quadratic form to EuclideanSpace inner products,
- Application of theorem_1_19_tail_bound (book lines 1610-1614),
- The dimension bound dim(span) ≤ k + |θ*|₀ (book line 1604).
Proved via single_support_thm119_bound' in Thm_2_14_SubspaceBound.lean,
which composes subspace_thm119_bound (Theorem 1.19 for subspace projections)
with onb_event_containment (ONB construction axiom for column spans).
Raw per-support-size bound from Theorem 1.19 + union over supports.
For a fixed support size k, the book (lines 1600-1615) applies Theorem 1.19 to each support set S of size k, using the ONB of span{X_j : j ∈ S ∪ supp(θ*)}. The union bound over the (d choose k) supports of size k gives:
μ(bad event for l0norm ≤ k) ≤ (d choose k) · 6^(k + |θ*|₀) · exp(-t/(32σ²) - n·τsq·k/(16σ²))
The proof factors the argument into:
- single_support_thm119_bound (proved via SubspaceBound, by reference to Theorem 1.19): for each support set S of size k, the per-support concentration bound.
- Set containment: the bad event {∃θ, l0norm θ ≤ k ∧ bad(θ)} is contained in the union ⋃_{S ∈ powersetCard(k)} {∃θ, supp(θ) ⊆ S ∧ bad(θ)}.
- Union bound over the (d choose k) support sets.
The remaining Lemma 2.7 absorption step is proved explicitly in per_size_concentration_bound.
Per-support-size concentration bound (book lines 1596-1617).
This combines the raw Theorem 1.19 bound (per_size_thm119_raw_bound) with Lemma 2.7 (binom bound) and the τ² condition to obtain:
μ(bad event for l0norm ≤ k) ≤ exp(-t/(32σ²) - k·log(ed) + |θ*|₀·log(12))
The proof explicitly performs the Lemma 2.7 absorption step (book line 1617):
- Lemma 2.7: (d choose k) ≤ (ed/k)^k ≤ (ed)^k,
- Combine: (d choose k) · 6^(k+|θ*|₀) ≤ (6ed)^k · 6^|θ*|₀,
- Use τ² condition: n·τ²·k/(16σ²) ≥ k·(log 6 + 2·log(ed)),
- Absorb: (6ed)^k · exp(-nτ²k/(16σ²)) ≤ exp(-k·log(ed)),
- Relax: 6^|θ*|₀ ≤ 12^|θ*|₀.
Containment of θ=0 bad event in the k≥1 union (corrected statement).
The θ=0 bad event is contained in the union ⋃_{k=1}^d S(k) where S(k) uses the condition l0norm θ ≤ k (allowing θ=0 as a witness for any k ≥ 1, since l0norm(0)=0 ≤ k). This matches the book’s decomposition over support sets S with |S|=k, where supp(θ) ⊆ S includes θ=0 for any nonempty S.
Union bound + Theorem 1.19 intermediate step (book lines 1596-1617).
This theorem encodes the combined result of:
- The support decomposition (book lines 1596-1604),
- ONB construction for column subspaces (book line 1604),
- The ε-net argument from Theorem 1.19 (book lines 1610-1614),
- The union bound over supports and support sizes (book line 1608),
- Lemma 2.7 bounding (d choose k) · exp(-2k·log(ed)) ≤ exp(-k·log(ed)) (book line 1617).
The intermediate result (book line 1617, after Lemma 2.7): μ(bad event) ≤ ∑_{k=1}^d exp(-t/(32σ²) - k·log(ed) + |θ*|₀·log(12))
The proof uses per_size_concentration_bound (axiom, for the by-reference Theorem 1.19 and Lemma 2.7 steps) and zero_bad_event_contained (proved).
Sup-out probability bound (Rigollet, line 1618).
This helper encodes the probability bound derived in lines 1596-1618 of the proof of Theorem 2.14. The argument proceeds by:
- Decomposing the sup over θ as max over support sizes k and supports S (line 1596-1604),
- Applying the ε-net argument from Theorem 1.19 to bound the sub-Gaussian maximum over each subspace (line 1610-1614), and
- Using Lemma 2.7 to bound the binomial coefficients and summing the geometric series (line 1616-1618).
The result is (book line 1618): P(∃θ, 4[εᵀU(θ-θ*)]² - 2nτ²|θ|₀ ≥ t) ≤ exp(-t/(32σ²) + |θ*|₀·log(12))
The proof uses sup_out_intermediate_bound (which uses per_size_concentration_bound for the by-reference steps) and then bounds the geometric series explicitly.
Noise concentration bound (from the sub-Gaussian union/ε-net argument).
This encodes the probabilistic core of Theorem 2.14's proof: with probability ≥ 1 - δ, the "sup-out" noise bound 4⟨ε,X(θ-θ*)⟩² ≤ |X(θ-θ*)|² · (2nτ²|θ|₀ + t) holds simultaneously for all θ, where τ² = (16·log(6) + 32·log(ed))·σ²/n and t = 32σ²·|θ*|₀·log(12) + 32σ²·log(1/δ).
Proved using sup_out_probability_bound (which encapsulates the Theorem 1.19
application and union bound from the book's proof, lines 1596-1618).
Theorem 2.14 (probabilistic form). Assume the linear model Y = Xθ* + ε where ε ~ subG_n(σ²). The BIC estimator θ̂^BIC with regularization parameter τ² = 16·log(6)·σ²/n + 32·σ²·log(ed)/n satisfies, with probability at least 1 - δ (book line 1622): |X(θ̂ - θ*)|₂² ≤ 224·|θ*|₀·σ²·log(ed) + 32·σ²·log(1/δ)
The two-term structure is faithful to the book's proof: the first term comes from 2nτ²|θ*|₀ + 32σ²|θ*|₀log(12), and the second from 32σ²log(1/δ). These have different dependence on |θ*|₀.