Squared Frobenius norm #
Frobenius inner product #
Rank Penalization Objective and Estimator #
The rank penalization objective function: L(Θ) = (1/n)‖Y - XΘ‖_F² + 2τ²·rank(Θ)
This is the spectral-domain analogue of the BIC estimator: instead of penalizing ℓ₀ sparsity of the parameter vector, we penalize the rank of the parameter matrix.
Instances For
IsRankPenalizationEstimator Y X τ Θhat asserts that Θhat is a minimizer of the
rank penalization objective:
Θ̂^RK ∈ argmin_Θ { (1/n)‖Y - XΘ‖_F² + 2τ²·rank(Θ) }
This is the multivariate analogue of the BIC estimator (Definition 2.3) in the spectral domain.
Instances For
Prediction MSE #
Theorem 4.4: Rank Penalization Estimator Bound #
Proof strategy #
The proof uses the minimizer property and a direct bound. From the rank penalization minimality at Θ = Θ*:
(1/n)‖Y - XΘ̂‖_F² + 2τ²rank(Θ̂) ≤ (1/n)‖Y - XΘ*‖_F² + 2τ²rank(Θ*)
Substituting Y = XΘ* + E and expanding: (1/n)‖E - XΔ‖_F² + 2τ²rank(Θ̂) ≤ (1/n)‖E‖_F² + 2τ²rank(Θ*)
where Δ = Θ̂ - Θ*. Expanding ‖E - XΔ‖² = ‖E‖² + ‖XΔ‖² - 2⟨E, XΔ⟩: (1/n)(‖E‖² + ‖XΔ‖² - 2⟨E,XΔ⟩) + 2τ²rank(Θ̂) ≤ (1/n)‖E‖² + 2τ²rank(Θ*)
Cancel ‖E‖²: (1/n)(‖XΔ‖² - 2⟨E,XΔ⟩) + 2τ²rank(Θ̂) ≤ 2τ²rank(Θ*)
Rearranging: (1/n)‖XΔ‖² ≤ (2/n)⟨E,XΔ⟩ + 2τ²rank(Θ*) - 2τ²rank(Θ̂)
Then bound the cross-term using Hölder's inequality and SVD: (2/n)⟨E,XΔ⟩ ≤ (1/(2n))‖XΔ‖² + 2τ²(rank(Θ̂) + rank(Θ*))
This combines Young's inequality (2ab ≤ a² + b²) with the spectral bound from the SVD and the operator norm hypothesis.
Substituting: (1/n)‖XΔ‖² ≤ (1/(2n))‖XΔ‖² + 2τ²(rank(Θ̂)+rank(Θ*)) + 2τ²rank(Θ*) - 2τ²rank(Θ̂) (1/(2n))‖XΔ‖² ≤ 4τ²rank(Θ*) (rank(Θ̂) cancels) (1/n)‖XΔ‖² ≤ 8τ²rank(Θ*)
The Frobenius norm expansion identity: ‖A - B‖_F² = ‖A‖_F² + ‖B‖_F² - 2·Σᵢⱼ Aᵢⱼ·Bᵢⱼ
Cross-term bound (Steps 3-5 of Rigollet's proof) #
The key analytic step that requires SVD and Schatten norm machinery:
Given Φ with orthonormal columns spanning col(X) and the operator norm bound ‖ΦᵀE‖_op² ≤ nτ² (from Lemma 4.2 applied to ΦᵀE), the cross-term satisfies:
(2/n)⟨E, XΔ⟩ ≤ (1/(2n))‖XΔ‖_F² + 2τ²(rank(Θ̂) + rank(Θ*))
This uses:
- Write XΔ = ΦN (since col(X) ⊆ col(Φ)), with rank(XΔ) ≤ rank(Θ̂) + rank(Θ*)
- ⟨E, XΔ/‖XΔ‖_F⟩ = ⟨ΦᵀE, N/‖N‖_F⟩ ≤ ‖ΦᵀE‖_op · ‖N/‖N‖F‖* (Hölder)
- Nuclear-Frobenius: ‖N/‖N‖F‖* ≤ √rank(N) (Cauchy-Schwarz on singular values)
- Apply Young's inequality: 2ab ≤ a²/ε + εb² with appropriate ε
- Use rank(XΔ) ≤ rank(Θ̂) + rank(Θ*) (rank subadditivity for Δ = Θ̂ - Θ*)
This requires SVD decomposition and Schatten/nuclear norm inequalities
that are not available in Mathlib. The SVD-dependent combined bound
svd_inner_product_rank_bound is proved using holder_trace_rank_bound (axiom)
as the single remaining unproved ingredient.
Frobenius inner product transpose property. ⟨E, X·A⟩_F = ⟨Xᵀ·E, A⟩_F
Proof: Expand both sides as triple sums and swap summation order.
Frobenius inner product helper lemmas #
Frobenius inner product with a zero-norm matrix is zero.
SVD-based unit inner product bound #
The following is the SVD-dependent core of the proof of Theorem 4.4. Given the unit direction U = XΔ/‖XΔ‖_F, the textbook shows:
⟨E, U⟩² ≤ n · τ² · (rank(Θ̂) + rank(Θ*))
The proof (Rigollet, lines 2842-2860) uses three sub-results that are stated without proof in Section 4.1 or are standard linear algebra facts:
- SVD decomposition: X = ΦΛΨᵀ yields XΔ = ΦN with Φ orthonormal
- Von Neumann trace / Hölder inequality for Schatten norms (Section 4.1): |⟨A, B⟩| ≤ ‖A‖op · ‖B‖* (nuclear norm × operator norm)
- Nuclear-Frobenius inequality (Cauchy-Schwarz on singular values): ‖M‖_* / ‖M‖_F ≤ √rank(M)
- Operator norm bound via Lemma 4.2: ‖ΦᵀE‖_op ≤ √(nτ²)
The proof factors through three helper lemmas:
holder_trace_rank_bound: the Hölder-Schatten + nuclear-Frobenius combined bound (axiom)frobeniusNormSq_orth_left: ‖ΦC‖²_F = ‖C‖²_F when ΦᵀΦ = Irank_le_of_left_inverse_mul: rank(C) ≤ rank(Φ*C) when Φ has a left inverse
Frobenius norm is preserved by left-multiplication with orthonormal columns. ‖ΦC‖²_F = ‖C‖²_F when ΦᵀΦ = I.
Proof: ‖ΦC‖²_F = ⟨ΦC, ΦC⟩_F = ⟨Φᵀ(ΦC), C⟩_F = ⟨(ΦᵀΦ)*C, C⟩_F = ⟨C, C⟩_F = ‖C‖²_F
Von Neumann trace inequality combined with Cauchy-Schwarz on singular values: |⟨A,B⟩_F|² ≤ ‖A‖²_op · rank(B) · ‖B‖²_F. The textbook states this without proof in Section 4.1.1 (Schatten norms and Hölder's inequality for traces). References: Horn & Johnson, Matrix Analysis, Thm 7.4.1.1; Bhatia, Matrix Analysis, IV.2.5.
Rank is monotone under left-multiplication by a matrix with left inverse. rank(C) ≤ rank(ΦC) when ΦᵀΦ = I.
Proof: C = Φᵀ*(ΦC), so rank(C) = rank(Φᵀ(ΦC)) ≤ rank(ΦC).
SVD-dependent inner product bound.
For U = XΔ/‖XΔ‖_F the unit-Frobenius-norm direction of M = XΔ: ⟨E, U⟩² ≤ n · τ² · rank(M)
Textbook proof (Rigollet, lines 2842-2860): Write M = ΦN (Φ has orthonormal columns spanning col(X)). Then ⟨E, U⟩ = ⟨ΦᵀE, N/‖N‖_F⟩. By Hölder (Section 4.1, stated without proof): |⟨ΦᵀE, N/‖N‖_F⟩| ≤ ‖ΦᵀE‖_op · ‖N/‖N‖F‖. Nuclear-Frobenius (Cauchy-Schwarz on singular values, not proved in book): ‖N/‖N‖F‖ ≤ √rank(N). And ‖ΦᵀE‖_op² ≤ nτ² (Lemma 4.2 + SVD of X).
Proof structure: Factor XΔ = ΦC through the column space basis Φ,
use orthogonal invariance of the Frobenius norm, the Hölder trace-rank
bound (holder_trace_rank_bound, axiom), and rank monotonicity.
SVD-dependent unit inner product bound.
When U is the unit-Frobenius-norm direction of XΔ = X(Θ̂-Θ*), the squared Frobenius inner product ⟨E, U⟩² is bounded by n·τ²·(rank(Θ̂) + rank(Θ*)).
Proof: Combines the SVD inner product bound (svd_inner_product_rank_bound,
which uses holder_trace_rank_bound as the only axiom) giving ⟨E, U⟩² ≤ n·τ²·rank(XΔ),
with the rank subadditivity result rank(X(Θ̂-Θ*)) ≤ rank(Θ̂) + rank(Θ*).
Cross-term bound via Hölder and SVD (core spectral inequality).
This is the key step in the proof of Theorem 4.4 that combines:
- Young's inequality (algebraic, proved here)
- SVD-based unit inner product bound (via
unit_inner_product_svd_bound)
The proof splits into two cases:
- If ‖XΔ‖_F = 0, the cross-term is 0 and the bound holds trivially.
- If ‖XΔ‖_F > 0, we decompose ⟨E, XΔ⟩ = ⟨E, U⟩·‖XΔ‖_F where U = XΔ/‖XΔ‖_F, apply Young's inequality, and use the SVD bound on ⟨E, U⟩².
Cross-term bound (Steps 3-5 of Rigollet's proof).
Given Φ with orthonormal columns spanning col(X) and ‖ΦᵀE‖_op² ≤ nτ², the cross-term in the Frobenius expansion satisfies:
(2/n)⟨E, XΔ⟩_F ≤ (1/(2n))‖XΔ‖_F² + 2τ²(rank(Θ̂) + rank(Θ*))
where Δ = Θ̂ - Θ*.
Proof: This is a direct consequence of crossterm_svd_bound,
which captures the SVD-based spectral argument from the textbook proof.
Step 1: Minimizer inequality after substitution #
The minimizer inequality expressed in terms of E and Δ: After substituting Y = XΘ* + E: (1/n)‖E - X(Θ̂-Θ*)‖² + 2τ²rank(Θ̂) ≤ (1/n)‖E‖² + 2τ²rank(Θ*)
Theorem 4.4 (deterministic core inequality).
Consider the multivariate regression model Y = XΘ* + E.
If Θ̂^RK minimizes (1/n)‖Y - XΘ‖_F² + 2τ²·rank(Θ) and Φ has
orthonormal columns spanning col(X) with ‖ΦᵀE‖_op² ≤ nτ²,
then:
(1/n)‖XΘ̂^RK - XΘ*‖_F² ≤ 8 · rank(Θ*) · τ²
The proof uses:
- The minimizer property (comparing Θ̂ to Θ*)
- Frobenius norm expansion ‖E - XΔ‖² = ‖E‖² + ‖XΔ‖² - 2⟨E,XΔ⟩
- The cross-term bound (proved via Hölder-nuclear-Frobenius helper): (2/n)⟨E,XΔ⟩ ≤ (1/(2n))‖XΔ‖² + 2τ²(rank(Θ̂) + rank(Θ*))
- Algebraic cancellation of rank(Θ̂) terms
Theorem 4.4 (probabilistic version).
Under the multivariate regression model Y(ω) = XΘ* + E(ω) where
E ~ subG_{n×T}(σ²), the rank penalization estimator satisfies:
(1/n)‖XΘ̂^RK - XΘ*‖_F² ≤ 8 · rank(Θ*) · τ²
on the event where ‖ΦᵀE‖_op² ≤ nτ² (which by Lemma 4.2 + SVD of X has
probability at least 1 - δ when τ is chosen appropriately).
This gives the rate ≲ σ²·rank(Θ*)·(d ∨ T + log(1/δ))/n with probability 1-δ.
Rate bound: connecting 8·rank(Θ*)·τ² to σ²·rank(Θ*)·(d∨T+log(1/δ))/n #
The textbook states that the bound 8·rank(Θ*)·τ² with τ chosen from Lemma 4.2
yields the rate ≲ σ²·rank(Θ*)·(d∨T + log(1/δ))/n.
When the "score matrix" (1/n)XᵀE is sub-Gaussian with variance proxy σsq/n
(which follows from the regression model when X has orthonormal rows),
Lemma 4.2 gives:
τ₀ = 4√(σsq/n)·√(log(12)·(d∨T)) + 2√(σsq/n)·√(2·log(1/δ))
Then τ₀² ≤ 2·(16·(σsq/n)·log(12)·(d∨T)) + 2·(4·(σsq/n)·2·log(1/δ)) = 32·log(12)·(σsq/n)·(d∨T) + 16·(σsq/n)·log(1/δ) ≤ 32·log(12)·(σsq/n)·(d∨T + log(1/δ))
So 8·rank(Θ*)·τ₀² ≤ 256·log(12)·σsq·rank(Θ*)·(d∨T + log(1/δ))/n.
The constant 256·log(12) ≈ 636 can be improved; the "≲" notation hides it.
Rate bound for Theorem 4.4.
When τ = 4√σ₀·√(log(12)·D) + 2√σ₀·√(2·L) where σ₀ = σsq/n is the per-sample variance proxy and D = d∨T, L = log(1/δ), then:
8·rank(Θ*)·τ² ≤ 256·log(12) · σsq/n · rank(Θ*) · (D + L)
This uses (a+b)² ≤ 2(a²+b²) and elementary arithmetic. The bound captures the "≲ σ²·rank(Θ*)·(d∨T + log(1/δ))/n" rate.
Theorem 4.4 (full probabilistic version with rate).
Under the multivariate regression model Y(ω) = XΘ* + E(ω) where
E ~ subG_{n×T}(σ²), the rank penalization estimator with
τ = 4√(σsq/n)·√(log(12)·(d∨T)) + 2√(σsq/n)·√(2·log(1/δ))
satisfies:
(1/n)‖XΘ̂^RK - XΘ*‖_F² ≤ 256·log(12) · (σsq/n) · rank(Θ*) · (d∨T + log(1/δ))
on the event where ‖ΦᵀE‖_op² ≤ nτ², which has probability at least
1 - δ by Lemma 4.2 + SVD of X.
This formalizes the textbook's "≲ σ²·rank(Θ*)·(d∨T + log(1/δ))/n" with the explicit constant 256·log(12).
Theorem 4.4 (full probabilistic statement with measure bound).
Under the multivariate regression model Y(ω) = XΘ* + E(ω) where
ΦᵀE ~ subG_{r×T}(σ²) (the projected noise matrix is sub-Gaussian),
the rank penalization estimator with
τ = 4√σ²·√(log(12)·(r∨T)) + 2√σ²·√(2·log(1/δ))
satisfies:
Pμ { ω | predictionMSE X (Θ̂(ω)) Θ* > 8 · rank(Θ*) · τ² } ≤ δ
This is the measure-theoretic formalization of the textbook's "with probability at least 1 - δ" statement of Theorem 4.4.
Proof structure:
- The deterministic theorem gives: on {ω | ‖ΦᵀE(ω)‖_op² ≤ nτ²}, the MSE bound holds.
- The bad MSE event is contained in {ω | ‖ΦᵀE(ω)‖_op > τ} (since ‖ΦᵀE‖_op ≤ τ ⇒ ‖ΦᵀE‖_op² ≤ τ² ≤ nτ² for n≥1).
- By
lemma_4_2_operator_norm_high_prob, this event has probability ≤ δ.