Home / Docs-Technical WhitePaper / 09-EFT.WP.Core.Density v1.0
Chapter 4 — Kernel Density Estimation and Smoothing
I. Objectives and Scope
- Specify a unified formulation of kernel density estimation kde_h(x), its error decomposition, and principled bandwidth h selection so results are auditable end-to-end via workflow Mx-93.
- Cover univariate and multivariate settings, fixed and adaptive bandwidths, boundary correction, and deconvolution; align with this volume’s S92-* numbering and with Core.Sea’s window energetics U_w, ENBW_Hz.
- Deliverables: minimal equations S92-5, S92-6; implementation binding I90 2; manifest fields and quality thresholds.
II. Kernel Families and Basic Properties
- Kernel definition and constraints.
K(u) ≥ 0, ( ∫ K(u) du = 1 ); a second-order kernel satisfies mu_1(K) = ( ∫ u K(u) du ) = 0 and 0 < mu_2(K) = ( ∫ u^2 K(u) du ) < ∞.
Let R(K) = ( ∫ K(u)^2 du ) (variance constant) and mu_r(K) the r-th raw moment. - Common examples (notation).
Gaussian: K_gauss(u) = ( 1 / sqrt(2*pi) ) * exp( - u^2 / 2 ) (unbounded support).
Epanechnikov: K_ep(u) = 0.75 * ( 1 - u^2 )_+ (compact support, (•)_+ is nonnegative truncation).
Triweight/Biweight/Uniform/Triangular as needed; for second-order kernels the Epanechnikov is MISE-optimal. - Alignment with windowing.
When K is used as a smoothing window, report the energy convention U_w = ( 1 / N ) * ∑ w[n]^2 and ENBW_Hz = fs * ( ∑ w[n]^2 ) / ( ∑ w[n] )^2 (see this volume and Core.Sea Chapter 5).
III. Univariate KDE: Definition, Bias, and Variance
- Minimal equation S92-5 (KDE).
S92-5 : kde_h(x) = ( 1 / ( N * h ) ) * ∑_{i=1}^N K( ( x - x_i ) / h ).
Weighted form: kde_h^w(x) = ( 1 / ( h * ∑ w_i ) ) * ∑ w_i * K( ( x - x_i ) / h ), with w_i > 0. - First-order bias and variance (large-sample approximations).
bias( kde_h(x) ) ≈ ( h^2 / 2 ) * mu_2(K) * p''(x);
var( kde_h(x) ) ≈ ( 1 / ( N * h ) ) * R(K) * p(x).
Trade-off: increasing h lowers variance and increases bias; decreasing h does the opposite.
IV. MISE and AMISE (Minimal Equations)
- ISE(h) = ( ∫ ( kde_h(x) - p(x) )^2 dx ); MISE(h) = E[ ISE(h) ].
- Minimal equation S92-6 (AMISE approximation).
S92-6 : AMISE(h) ≈ ( R(K) / ( N * h ) ) + ( ( h^4 / 4 ) * mu_2(K)^2 * R( p'' ) ), where R( p'' ) = ( ∫ ( p''(x) )^2 dx ).
Ideal bandwidth: h_AMISE = ( R(K) / ( mu_2(K)^2 * R( p'' ) * N ) )^(1/5) (requires a pilot estimate of R( p'' )).
V. Bandwidth Selection: Rules, Cross-Validation, Plug-In
- Rule-of-thumb (1D).
Scott: h_scott = sigma_x * N^(-1/5);
Silverman: h_silver = 0.9 * min( sigma_x , IQR / 1.34 ) * N^(-1/5);
Robust scale: sigma_robust = min( sigma_x , MAD / 0.6745 ) as a drop-in for sigma_x. - Least-squares cross-validation (LSCV).
CV(h) = ( ∫ ( kde_h(x) )^2 dx ) - ( 2 / N ) * ∑_{i=1}^N kde_{-i,h}( x_i ),
with kde_{-i,h}( x_i ) = ( 1 / ( (N-1) * h ) ) * ∑_{j ≠ i} K( ( x_i - x_j ) / h ).
Choose h* = argmin_h CV(h) and record CV(h*). - Likelihood cross-validation (LCV).
LCV(h) = ( 1 / N ) * ∑_{i=1}^N log( kde_{-i,h}( x_i ) ), choose h* = argmax_h LCV(h). - Plug-in.
Estimate p'' via a pilot kernel or normal approximation to obtain R( p'' ), then back-substitute in h_AMISE. - Grid and line search.
Search on a log scale: h = h0 * exp( k * Delta ); for multi-modal CV(h) use smoothing or golden-section refinement.
VI. Boundary and Support Corrections
- Reflection (interval [a,b]).
Use mirror samples x_i^L = 2a - x_i, x_i^R = 2b - x_i:
kde_h^ref(x) = ( 1 / ( N * h ) ) * ∑ [ K( ( x - x_i ) / h ) + K( ( x - x_i^L ) / h ) + K( ( x - x_i^R ) / h ) ]. - Transform–back (positive support).
y = log( x - a ), estimate kde_h(y) in y-space; map back
p_X(x) = p_Y( log( x - a ) ) * ( 1 / ( x - a ) ). - Constrained renormalization.
If releasing only on [a,b]: set Z = ( ∫_a^b kde_h(x) dx ), publish kde_h(x)/Z and record Z.
VII. Multivariate KDE and Bandwidth Matrices
- Definition.
kde_H(x) = ( 1 / ( N * |H|^(1/2) ) ) * ∑ K_d( H^(-1/2) * ( x - x_i ) ).
K_d(u) = ∏_{j=1}^d K(u_j) (product kernels) or a spherically symmetric kernel. - Bandwidth structures.
Scalar: H = h^2 * I_d; diagonal: H = diag( h_1^2 , ... , h_d^2 ); full: H = A A^T.
Scott’s rule (d dimensions): H = c * Sigma * N^(-2/(d+4)), with Sigma the sample covariance and c a kernel constant. - Sphering and back-transform.
Set z = Sigma^(-1/2) * ( x - mu_x ), pick H_z = h^2 * I_d in the sphered space, and map back H = Sigma^(1/2) * H_z * Sigma^(1/2).
VIII. Variable Bandwidth (Adaptive KDE)
- Two families.
Balloon: kde(x) = ( 1 / ( N * h(x) ) ) * ∑ K( ( x - x_i ) / h(x) );
Sample-point: kde(x) = ( 1 / N ) * ∑ ( 1 / h_i ) * K( ( x - x_i ) / h_i ). - Typical setting.
Use a pilot h0 to get kde_0(x), then set h_i = h0 * ( kde_0( x_i ) )^( -alpha ), alpha ∈ [0 , 1/2], often alpha = 1/2. - Pros/cons.
Larger bandwidth in low-density regions lowers variance; tighter bandwidth in high-density regions lowers bias; always record pilot details and alpha.
IX. Deconvolution KDE (with Measurement Noise)
- Observation model.
Y = X + E, noise density phi_e known; target is p_X. - Frequency-domain construction.
Let Phi_K(t) = Fourier{ K }(t), Phi_e(t) = Fourier{ phi_e }(t):
Define deconvolution kernel spectrum Phi_L(t) = Phi_K(t) / Phi_e( t / h ), then L = Fourier^{-1}{ Phi_L }.
Estimator: kde_h^deconv(x) = ( 1 / ( N * h ) ) * ∑ L( ( x - y_i ) / h ). - Regularization and stability.
Truncate bands where |Phi_e(•)| is small or apply Tikhonov:
Phi_L(t) = Phi_K(t) * conj( Phi_e( t / h ) ) / ( |Phi_e( t / h )|^2 + lambda ); record lambda.
X. Derivatives and Set Estimation
- Density derivatives.
∂^m kde_h / ∂x^m = ( 1 / ( N * h^(m+1) ) ) * ∑ K^(m)( ( x - x_i ) / h ). - Level sets and highest density regions.
C_tau = { x : kde_h(x) ≥ tau }; pick tau such that ( ∫_{C_tau} kde_h(x) dx ) = q, q ∈ (0,1). - Heuristic confidence bands.
Use bootstrap bands on a grid; report resample count and random seed.
XI. Streaming and Time Weighting
- Exponential decay weights.
w_i = exp( - ( ts_now - ts_i ) / tau ), with time constant tau; update kde_h^w(x) online (cf. formula above). - Rolling windows.
Maintain a deque for q_len and cumulative weight ∑ w_i; update normalization on enqueue/dequeue; align with Core.Threads backpressure strategy.
XII. Quality Control and Publication
- Normalization check.
Compute Z = ( ∫ kde(x) dx ) (or its discrete sum). If |Z - 1| > eps_norm, renormalize prior to release and record Z. - Bandwidth stability.
Perturb h* over {0.8, 1.0, 1.25}; the CV/LCV scores should not cross stability thresholds. - Boundary and support.
Publish support, boundary method (reflection|transform|renorm), and parameters. - Transparency fields.
Record kernel name K, bandwidth h or H, the selection score CV(h) or LCV(h), pilot details, and any regularization parameters.
XIII. Implementation Binding and Workflow Mx-93 (Bandwidth Selector)
- Inputs & prechecks. Load {x_i, ts_i, w_i?}; validate unit(x) and dim(x); if time-weighted, compute w_i.
- Support identification. Detect positive or bounded support; choose and lock a boundary strategy.
- Kernel & scale. Choose K; compute sigma_robust; build a log-spaced grid h_grid.
- Scorer. Select CV or LCV; implement efficient kde_{-i,h}(x_i) (KD-tree / FFT / blocked evaluation).
- Search & refine. Optimize over h_grid, refine locally; if a plug-in h_AMISE exists, include it among candidates.
- Normalize & validate. Compute Z and renormalize after boundary handling; run sensitivity checks around h*.
- Publish & bind. Produce a PdfRef via kde_build, optionally persist a kde_eval grid; write manifest and diagnostics.
- Audit metrics. {h*, score*, Z, eps_norm, runtime, method, pilot, support, K}.
XIV. Interface Contract (Aligned with I90)
- kde_build(data:any, kernel:str="gaussian", h:float|None=None, rule:str|None=None) -> PdfRef
Inputs: kernel ∈ {"gaussian","epanechnikov",...}; when h=None, choose via rule ∈ {"scott","silverman","cv","plugin"}.
Output: includes {"K":..., "h|H":..., "method":..., "score":..., "support":..., "pilot":...} and an evaluator handle. - kde_eval(pdf:PdfRef, x:any, normalize:bool=True) -> array
With normalize=false, return raw (pre-renormalization) values to allow custom normalization and boundary correction. - Also: hist_density as a baseline; renormalize for pre-publication consistency.
XV. Minimal Manifest Fields (for Ingest)
- kde = {"K":"gaussian|ep|...", "bandwidth":{"type":"scalar|diag|full", "h":..., "H":...}, "selection":"cv|lcv|plugin|scott|silverman", "score":..., "pilot":{"method":"...", "alpha":..., "h0":...}, "support":{"type":"R|[a,b]|(a,∞)", "boundary":"reflect|transform|renorm", "params":...}, "weights":{"enabled":true|false, "rule":"time_decay|custom", "tau":...}}
- qc = {"Z":..., "eps_norm":..., "sensitivity":{"0.8h":..., "1.25h":...}, "runtime_ms":..., "notes":"..."}
- timing = {"ts":"UTC", "tau_mono":"...", "Delta_t":...}
XVI. Cross-Volume Coherence
- If KDE is used to smooth spectral or energy densities, the window-energy conventions must match Core.Sea Chapter 5: U_w, ENBW_Hz, and be echoed in the manifest.
- Compatible with Chapter 9’s standardization z = ( x - mu_x ) / sigma_x; publish the back-transform alongside.
- For uncertainty propagation (Chapter 10), provide bootstrap or Delta-method bands for kde_h(x) and bandwidth uncertainty.
XVII. Chapter Highlights
- Canonicalized S92-5, S92-6; systematized bandwidth selection via rules, CV, and plug-in; covered boundary handling, anisotropy, multivariate, adaptive, and deconvolution settings.
- Delivered the engineering workflow Mx-93, the I90 bindings, and auditable manifest fields and QC thresholds—ensuring cross-volume consistency and traceability.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/