Home / Docs-Technical WhitePaper / 05-EFT.WP.Core.Errors v1.0
Chapter 4 — Outlier Detection and Data Quality
I. Objectives and Terminology
- Objective: within the data–model loop, identify and handle anomalous samples in a traceable and reproducible way to improve estimation stability and protect the validity of the error budget EB.
- Terms: outlier (an observation departing from the generative mechanism), anomaly (a runtime abnormal event), mask_outlier ∈ {0,1} (hard mask), w ∈ [0,1] (soft weight), r = y - f(x; theta) (residual), r_bar = r / sigma (standardized residual).
- Dimensional consistency: every detection statistic must be dimensionless or carry explicit units; check_dim must pass. When data quality involves arrival time T_arr, the path gamma(ell) and the measure d ell must be explicit.
II. Postulates and General Requirements
- P74-1 (Location–scale invariance): outlier statistics must remain threshold-comparable under affine transforms x_to = a * x_from + b; use standardization or a robust scale s.
- P74-2 (Single-source primacy): for any observation, the outlier label should be decided primarily by one criterion; other criteria serve as supporting evidence in the explanation domain and trace chain.
- P74-3 (Weight first, excision later): prefer soft down-weighting; only apply hard exclusion (mask_outlier = 1) when task risk is unacceptable.
III. Univariate Detection (Static Samples)
- Z-score: z_i = ( x_i - mu ) / sigma, example threshold |z_i| > 3.5. mu, sigma come from a steady segment or a reference sample.
- MAD rule: MAD = median( | x - median(x) | ), s = 1.4826 * MAD; flag if | x_i - median(x) | / s > t0 (commonly t0 ∈ [3,4.5]).
- IQR rule: IQR = Q3 - Q1; flag if x_i < Q1 - k * IQR or x_i > Q3 + k * IQR (k = 1.5 standard, k = 3 strict).
- With per-sample uncertainty: when a sample carries u(x_i), use z_u,i = ( x_i - mu ) / u(x_i) against a fixed threshold.
IV. Multivariate Detection (Correlated Dimensions)
- Mahalanobis distance: d_M^2(x) = ( x - mu )^T * Cov^{-1} * ( x - mu ). Decision: d_M^2 > q_chi2(1 - alpha, p).
- Robust covariance: replace Cov with a robust estimate Cov_r (e.g., MCD or IRLS-weighted second moments).
- Correlation declaration: when Cov_r is used, reports must state the estimator, sample size, and convergence criteria.
V. Time-Series and Streaming Detection
- Hampel filter (window k): with window median med_t and scale s_t = 1.4826 * MAD_t, flag if | x_t - med_t | / s_t > t0 (typical t0 = 3.0).
- Quick change-point (CUSUM style): S_t = max( 0, S_{t-1} + x_t - mu - k_c ), alarm if S_t > h (k_c, h tuned by risk).
- Rate-of-change threshold: | x_t - x_{t-1} | / s_delta > tau_rc, where s_delta is a robust scale of differences.
- Seasonality/trend removal: for x_t = trend_t + season_t + resid_t, detect on resid_t to avoid false positives.
VI. Residual- and Fit-Based Detection (Links to Chapters 2 and 3)
- Standardized residuals: r_bar = r / sigma; flag if | r_bar | > t_r (t_r ∈ [3,5], adjusted by n - p).
- Global fit statistic: chi2 = r^T R r; if chi2 > q_chi2(1 - alpha, n - p) the overall fit is poor; proceed to model review or weight re-estimation.
- IRLS soft reweighting: w_i = psi( r_bar_i ) / r_bar_i with Huber or Tukey to down-weight high-residual points; update theta and recompute u_c(y).
- RANSAC (geometric consistency): iteratively sample under a consensus model to obtain Inliers and parameters; mark outliers with mask_outlier = 1.
VII. Missingness, Duplicates, and Bounds
- Missingness: m_i ∈ {0,1} with m_i = 1 indicating missing; do not classify as outlier by itself, and exclude from scale estimation.
- Reproducible imputation (dimension-consistent): median imputation or nearest-neighbor forward fill (with down-weight w_i); reports must include is_imputed.
- Duplicates: de-duplicate by hash or timestamp; define uniqueness = 1 - (#duplicates) / N.
- Physical bounds: samples outside the feasible domain are marked mask_outlier = 1 and recorded as events.
VIII. Data-Quality Metrics and Threshold Baselines
- Coverage: coverage = (#valid) / N; freshness: freshness = now - max(timestamp); consistency: consistency = (#pass_invariants) / (#checks).
- Signal-to-noise: SNR = P_signal / P_noise; use PSNR for images/waveforms where applicable.
- Task-weighted quality index: DQI = ∑_j α_j * metric_j, with ∑ α_j = 1. Enter degradation when DQI < tau_DQI.
- Coupling to error budget: if Top-K contributions are driven by “data quality” factors, list them explicitly in EB and track remediation.
IX. Composite Decisions and Mitigation Strategies
- Hard gating: mask = ( zscore_detect OR hampel_filter OR ( d_M^2 > q_chi2 ) ).
- Soft gating: w = min( w_Huber(r_bar), w_Mahalanobis, w_Hampel ); proceed to IRLS refit.
- Decision priority: physical bounds > safety & compliance module > hard statistical thresholds > soft weight downgrades.
- Exits: keep, down-weight, remove, reacquire. Each exit must include remediation guidance and trace evidence.
X. Quality-Control Workflow Mx-2 (Executable)
- Data ingress and unit check: convert, check_dim.
- Baseline estimation: compute mu, sigma, MAD, Cov (or robust variants).
- Univariate detection: zscore_detect and MAD/IQR.
- Multivariate detection: d_M^2 and threshold testing.
- Time-series detection: hampel_filter(series, k, t0); optional CUSUM.
- Residual-domain detection: compute r, r_bar, chi2; apply IRLS or RANSAC.
- Mask and weight synthesis: produce mask_outlier and w; log reason codes.
- Refit and propagate: update theta; propagate per Chapter 3 to obtain updated u_c(y) and EB.
- Report and trace: output DQI, mask ratio, Top-K anomaly causes, and traceability_chain.
XI. Path-Level Outliers for Arrival Time T_arr (Cross-Volume Anchor)
- Path discretization: T_arr = ( ∑_k ( n_eff,k / c_ref ) * Δell_k ); each segment yields an observation pair ( n_eff,k, Δell_k ).
- Segment residuals: r_k = y_k - ( n_eff,k / c_ref ) * Δell_k, standardized r_bar,k = r_k / sigma_k.
- Segment detection: apply MAD rule to r_bar,k and zscore_detect to n_eff,k; also monitor bounds and consistency of Δell_k.
- Composite policy: if q consecutive segments trigger anomalies (q ≥ 3), flag the path section and request re-measurement; otherwise down-weight and retain the global solution.
- Reporting: declare gamma(ell) and d ell; update Chapter 3’s u_c(T_arr) and EB accordingly.
XII. Implementation Bindings and Interface Mapping (I50 3)
- zscore_detect(x:array, thresh:float=3.5) -> mask:array
Input: scalar series or column vector; Output: mask_outlier. - mad_scale(x:array) -> float
Returns robust scale s for MAD rules and IRLS initialization. - hampel_filter(series:array, k:int, t0:float=3.0) -> mask:array
Sliding-window radius k, threshold t0. - ransac_fit(model:any, data:any, max_iter:int, tol:float) -> dict
Output includes inlier indices, theta_hat, and fit-residual statistics. - Typical sequence:
- mask1 = zscore_detect(x, 3.5); mask2 = hampel_filter(x_t, k, 3.0); mask = mask1 OR mask2。
- Generate w via psi_weight, refit with IRLS; call propagate_error_delta to update u_c(y).
- attach_traceability(report, chain) to record the evidence chain and parameter sources.
XIII. Reporting and Compliance (Minimal Fields)
- Required: method, thresh, window, alpha, mask_ratio, w_summary, DQI, TopK_causes, RefCond, unit_policy, traceability_chain.
- Interoperation: integrate decisions with Core.Metrology round_by_unc, guard_band; synchronize versioning with Chapter 3’s EB.
XIV. Chapter Outputs and Linkage
- Outputs: univariate/multivariate/time-series/residual-domain outlier detection standards; soft/hard gating and mitigation strategies; workflow Mx-2; interface mapping to I50 3; T_arr path-level use case.
- Next: Chapter 5 links numerical errors (rounding and truncation) with this chapter’s detection to close the quality–numerics–uncertainty loop.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/