16-EFT.WP.Methods.Cleaning v1.0 | Chapter 7 Missingness, Masks, and Imputation Governance

Home ／ Docs-Technical WhitePaper ／ 16-EFT.WP.Methods.Cleaning v1.0

Chapter 7 Missingness, Masks, and Imputation Governance

One-Sentence Goal
Unify missingness semantics with an explicit mask m ∈ {0,1}, constrain imputation within Delta_t and physical conventions, and record imputation uncertainty u_imp and method signatures so downstream computations can be down-weighted, audited, and fully traceable.

I. Scope & Objects

Applicable targets
- All numeric and time-series columns in D_arrival (from Chapter 6) and their derivatives.
- Fields related to path, arrival time, density, and quality scoring when subjected to imputation and persistence.
Target artifacts
Produce D_missing and manifest.missing = { mask_fields, strategy, params, u_imp, coverage, gap_stats }; emit dashboard-oriented quality metrics and alerts.

II. Terms & Variables (Memory Anchors)

Mask & observability: m ∈ {0,1}, r = 1 - m. m=1 denotes missing, r=1 denotes available.
Missingness mechanisms: MCAR, MAR, MNAR (recorded in label field miss_mech).
Time & windows: ts, tau_mono, Delta_t, gap_max.
Imputation operators & publication: impute_{method}(x | context) -> x_tilde, corr_env(x; RefCond) (Chapter 12).
Uncertainty: u(x), u_imp, U = k * u_c.
Quality & weights: q_score ∈ [0,1], w_imp ∈ [0,1] (down-weight for imputed samples).

III. Axioms (P107-*)

P107-01 Explicit-mask axiom
All missingness must be explicitly recorded by m; implicit filling or silent extrapolation is forbidden.
P107-02 Reproducible-imputation axiom
Any imputation must record method, parameters, context, and random seed, forming a replayable signature.
P107-03 Causality & cadence axiom
Imputation respects causal order and publication cadence; it must not introduce time reversal or violate non_decreasing(ts).
P107-04 Dimension-consistency axiom
Before and after imputation must satisfy check_dim( x_tilde - x ).
P107-05 Uncertainty-accompaniment axiom
Generate u_imp for imputed segments and combine it with measurement uncertainty; imputation without u_imp is a contract failure.
P107-06 Risk-priority axiom
MNAR defaults to down-weighting or quarantine; only MAR/MCAR may assume unbiasedness.

IV. Minimal Equations (S107-*)

S107-01 Mask and coverage
r = 1 - m
coverage = mean( r )
S107-02 Gap detection (time series)
With adjacent differences Delta_ts_k = ts_{k+1} - ts_k:
gap_k = 1 if Delta_ts_k > gap_max, else gap_k = 0
S107-03 Linear imputation (in-window)
Given t0 < t < t1, w = ( t - t0 ) / ( t1 - t0 ):
x_tilde(t) = w * x(t1) + ( 1 - w ) * x(t0)
valid only if t1 - t0 ≤ Delta_t and gap_k = 0
S107-04 Forward hold (restricted)
x_tilde(t) = x(t_prev) if ts - ts_prev ≤ Delta_t_hold, otherwise remain missing.
S107-05 Model-based imputation (general form)
x_tilde = f_theta( Z ), where Z is the contextual feature set; record theta and the training interval.
S107-06 Imputation uncertainty composition
u^2_total(x_tilde) = u^2_meas(x) + u^2_imp(x)
For linear imputation with independent endpoints:
u^2_imp = w^2 * u^2(x(t1)) + (1 - w)^2 * u^2(x(t0))
S107-07 Down-weighting suggestion
w_imp = clip( 1 - alpha * miss_density_window , 0 , 1 )
where miss_density_window is the missing fraction in a window and alpha ∈ [0,1] is policy-set.
S107-08 Dimension check
check_dim( x_tilde - x ) = true

V. Missingness Detection & Mask Generation

Rule normalization
- Detect NaN/null, out-of-range values, saturation (e.g., x ∈ {x_min_sat, x_max_sat}), sentinel values (e.g., -9999) and map uniformly to m=1.
- For composite fields (e.g., (n_eff/c_ref)), define the mask by components: m = max( m(n_eff), m(c_ref) ).
Structural missingness
For fields never measured by certain devices, or path segments lacking medium data: tag structural_missing=1 and forbid automatic imputation.
Mechanism labeling
Use acquisition logs and statistical tests to label miss_mech ∈ {MCAR, MAR, MNAR}; send MNAR to quarantine or publish read-only with down-weighting.

VI. Cleaning Process (M10-7 Missingness, Masking, and Imputation)

Unified mask generation
Aggregate rules and sources to produce m; output coverage and gap_stats.
Mechanism assessment
Perform coarse checks (independence, correlation with covariates) and label miss_mech; route MNAR to quarantine or read-only publication.
Candidate-strategy selection
Prefer conservation- and causality-friendly methods: local linear/spline, constrained forward-hold, physics-constrained regression, Kalman/state-space.
Constrained imputation execution
Run impute_{method} within Delta_t and gap_max; out-of-bounds gaps remain m=1.
Uncertainty & down-weighting
Compute u_imp per S107-06; generate w_imp and persist it.
Dimension & contract checks
Run check_dim( x_tilde - x ); verify monotonic/physical constraints (e.g., x ≥ 0, arc-length monotone) and window coverage.
Manifest & signature
Write manifest.missing = { method, params, seeds, Delta_t, gap_max, u_imp, w_imp, coverage, miss_mech }; update signature and hash.
Artifacts
Output D_missing, then proceed to Chapter 8 (anomalies) and Chapter 10 (contract gate).

VII. Contracts & Assertions (Chapter Must-Pass Items)

Mask completeness: forall x: exists m(x)
No silent filling: sum( flags.silent_fill ) = 0
Imputation boundary: impute_only_if( (t1 - t0) ≤ Delta_t ∧ gap=0 )
Dimension conservation: check_dim( x_tilde - x ) = true
Uncertainty accompaniment: exists u_imp and total uncertainty updated
Mechanism labeling: exists miss_mech
Down-weight availability: exists w_imp ∈ [0,1]
Manifest completeness: exists(manifest.missing) with all fields present

VIII. Implementation Binding (I10-7)

Interface prototypes
- handle_missing(ds, strategy) -> ds', manifest
- infer_mask(ds, rules) -> m
- choose_impute(ds, meta) -> { method, params }
- impute_series(x, ts, method, params) -> { x_tilde, u_imp, w_imp }
- audit_impute(x, x_tilde, m) -> report
Preconditions
Chapters 5–6 contracts on ts, ell, and arrival.* are satisfied; Chapter 4 unit/dimension consistency holds.
Postconditions & invariants
m covers all relevant fields; x_tilde appears only within allowed windows; u_imp and w_imp are written; manifest.missing is replayable.
Failure semantics
E_MASK_RULE_CONFLICT, E_IMPUTE_WINDOW_EXCEED, E_DIM_FAIL_AFTER_IMPUTE, E_UNCERTAINTY_MISSING.

IX. Common Imputation Strategies & Guardrails

Linear interpolation
- Use: small gaps on stationary segments.
- Guardrail: t1 - t0 ≤ Delta_t; prefer non-imputed endpoints.
Spline interpolation (C2 or piecewise cubic)
- Use: smooth signals.
- Guardrail: forbid spanning transitions/steps (marked by anomaly detection or gradient thresholds).
Forward hold (ffill)
- Use: counters or step states.
- Guardrail: strict Delta_t_hold; publish held=1 flag.
State-space/Kalman
- Use: dynamic systems.
- Guardrail: put model and parameters into the manifest; bind training intervals to drift alerts.
Physics-constrained regression
- Use: conserved/non-negative/monotone variables.
- Guardrail: declare constraint set C = { x | A x ≤ b } in the manifest; violations trigger rollback.
Reference-condition adjustment (placeholder)
- Use: RefCond changes causing systematic shifts.
- Guardrail: record only here; do not apply corrections in this chapter (see Chapter 12).

X. Quality Metrics & Risk Control

Indicators
- Missing rate: miss_rate = mean(m)
- Gap stats: p95(Delta_ts_k), gap_ratio = mean(gap_k)
- Imputed share: imp_ratio = mean( r_tilde = 1 - m_after ) - coverage
- Uncertainty: mean(u_imp), p95(u_imp)
- Impact: share_downstream = fraction_of_downstream_ops_using_imputed
Alert suggestions
- If miss_rate > tol_miss → down-weight or quarantine sources
- If gap_ratio > tol_gap → shorten Delta_t or increase sampling
- If p95(u_imp) > tol_uimp → switch to robust strategies or block publication
- If imp_ratio > tol_imp → publish sparse summaries only or label as preview

XI. Boundaries & Special Cases

Structurally unmeasurable fields
Keep m=1 permanently; imputation is not allowed. Derive substitutes if needed (declare relationships).
Path-dependent quantities (e.g., n_eff(ell))
Permit only local interpolation along ell; do not cross medium segments or breakpoints (see Chapter 6 segmentation rules).
Arrival time T_arr
Do not impute T_arr directly; if necessary, impute local gaps in n_eff or c_ref and recompute both forms and delta_form.

XII. Audit & Panel Fields

Minimal panel
miss_rate, coverage, gap_ratio, imp_ratio, mean(u_imp), p95(u_imp), held_count, method_share, MNAR_ratio
Traceability fields
strategy.name, params, seed, version, signature, hash_sha256(blob).

XIII. Cross-References

Units & dimensions (check_dim, u(x) composition): Chapter 4 and Appendix E.
Time axis & synchronization (Delta_t, gap_max, causality): Chapter 5.
Path & arrival time (two forms and medium segmentation): Chapter 6.
Anomalies & drift (ordering with imputation): Chapter 8.
Contracts & release freeze: Chapter 10.
Quality scoring & audit: Chapter 14.

Summary
This chapter integrates missingness detection, mask generation, and constrained imputation into the standard loop: explicit semantics via m, imputation bounded by Delta_t and physical conventions, risk controlled by u_imp and w_imp, and full method/parameter provenance in manifest.missing. With causality and dimensional integrity preserved, data gain minimal yet safe usability improvements, laying an auditable foundation for Chapter 8 anomaly governance and Chapter 10 release gating.

Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published： 2025-11-11｜Current version：v5.1
License link：https://creativecommons.org/licenses/by/4.0/