HomeDocs-Technical WhitePaper16-EFT.WP.Methods.Cleaning v1.0

Chapter 7 Missingness, Masks, and Imputation Governance


One-Sentence Goal
Unify missingness semantics with an explicit mask m ∈ {0,1}, constrain imputation within Delta_t and physical conventions, and record imputation uncertainty u_imp and method signatures so downstream computations can be down-weighted, audited, and fully traceable.


I. Scope & Objects

  1. Applicable targets
    • All numeric and time-series columns in D_arrival (from Chapter 6) and their derivatives.
    • Fields related to path, arrival time, density, and quality scoring when subjected to imputation and persistence.
  2. Target artifacts
    Produce D_missing and manifest.missing = { mask_fields, strategy, params, u_imp, coverage, gap_stats }; emit dashboard-oriented quality metrics and alerts.

II. Terms & Variables (Memory Anchors)


III. Axioms (P107-*)


IV. Minimal Equations (S107-*)


V. Missingness Detection & Mask Generation

  1. Rule normalization
    • Detect NaN/null, out-of-range values, saturation (e.g., x ∈ {x_min_sat, x_max_sat}), sentinel values (e.g., -9999) and map uniformly to m=1.
    • For composite fields (e.g., (n_eff/c_ref)), define the mask by components: m = max( m(n_eff), m(c_ref) ).
  2. Structural missingness
    For fields never measured by certain devices, or path segments lacking medium data: tag structural_missing=1 and forbid automatic imputation.
  3. Mechanism labeling
    Use acquisition logs and statistical tests to label miss_mech ∈ {MCAR, MAR, MNAR}; send MNAR to quarantine or publish read-only with down-weighting.

VI. Cleaning Process (M10-7 Missingness, Masking, and Imputation)


VII. Contracts & Assertions (Chapter Must-Pass Items)


VIII. Implementation Binding (I10-7)

  1. Interface prototypes
    • handle_missing(ds, strategy) -> ds', manifest
    • infer_mask(ds, rules) -> m
    • choose_impute(ds, meta) -> { method, params }
    • impute_series(x, ts, method, params) -> { x_tilde, u_imp, w_imp }
    • audit_impute(x, x_tilde, m) -> report
  2. Preconditions
    Chapters 5–6 contracts on ts, ell, and arrival.* are satisfied; Chapter 4 unit/dimension consistency holds.
  3. Postconditions & invariants
    m covers all relevant fields; x_tilde appears only within allowed windows; u_imp and w_imp are written; manifest.missing is replayable.
  4. Failure semantics
    E_MASK_RULE_CONFLICT, E_IMPUTE_WINDOW_EXCEED, E_DIM_FAIL_AFTER_IMPUTE, E_UNCERTAINTY_MISSING.

IX. Common Imputation Strategies & Guardrails

  1. Linear interpolation
    • Use: small gaps on stationary segments.
    • Guardrail: t1 - t0 ≤ Delta_t; prefer non-imputed endpoints.
  2. Spline interpolation (C2 or piecewise cubic)
    • Use: smooth signals.
    • Guardrail: forbid spanning transitions/steps (marked by anomaly detection or gradient thresholds).
  3. Forward hold (ffill)
    • Use: counters or step states.
    • Guardrail: strict Delta_t_hold; publish held=1 flag.
  4. State-space/Kalman
    • Use: dynamic systems.
    • Guardrail: put model and parameters into the manifest; bind training intervals to drift alerts.
  5. Physics-constrained regression
    • Use: conserved/non-negative/monotone variables.
    • Guardrail: declare constraint set C = { x | A x ≤ b } in the manifest; violations trigger rollback.
  6. Reference-condition adjustment (placeholder)
    • Use: RefCond changes causing systematic shifts.
    • Guardrail: record only here; do not apply corrections in this chapter (see Chapter 12).

X. Quality Metrics & Risk Control

  1. Indicators
    • Missing rate: miss_rate = mean(m)
    • Gap stats: p95(Delta_ts_k), gap_ratio = mean(gap_k)
    • Imputed share: imp_ratio = mean( r_tilde = 1 - m_after ) - coverage
    • Uncertainty: mean(u_imp), p95(u_imp)
    • Impact: share_downstream = fraction_of_downstream_ops_using_imputed
  2. Alert suggestions
    • If miss_rate > tol_miss → down-weight or quarantine sources
    • If gap_ratio > tol_gap → shorten Delta_t or increase sampling
    • If p95(u_imp) > tol_uimp → switch to robust strategies or block publication
    • If imp_ratio > tol_imp → publish sparse summaries only or label as preview

XI. Boundaries & Special Cases


XII. Audit & Panel Fields


XIII. Cross-References


Summary
This chapter integrates missingness detection, mask generation, and constrained imputation into the standard loop: explicit semantics via m, imputation bounded by Delta_t and physical conventions, risk controlled by u_imp and w_imp, and full method/parameter provenance in manifest.missing. With causality and dimensional integrity preserved, data gain minimal yet safe usability improvements, laying an auditable foundation for Chapter 8 anomaly governance and Chapter 10 release gating.


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/