Home / Docs-Technical WhitePaper / 16-EFT.WP.Methods.Cleaning v1.0
Chapter 1 Definition and Scope of the Cleaning Domain
One-Sentence Goal
Define the objects, boundaries, and compliance objectives of cleaning, provide the minimal executable loop and release criteria, and ensure any input D_raw is transformed by M10-* into an auditable D_clean with a manifest.
I. Scope & Objects
- Covered scenarios
- Offline batch, online services, and event streams operate under one cleaning loop and one release criterion.
- Objects include time series, path-parameterized observations, event logs, scalar and tensor fields, and reference-environment records.
- Inputs and outputs
- Input: D_raw carrying schema_ver and the minimal manifest keys.
- Output: D_clean with a manifest containing four required domains: timing, arrival_forms, qc, contracts.
- Non-goals and boundaries
- Does not perform physical modeling or interpretation, and does not replace calibration or traceability standards.
- Does not prescribe storage implementations or orchestration engines, and specifies interfaces, contracts, and assertions only.
II. Terms & Variables
- Data & keys: D_raw, D_clean, rid, pk, idx_k, schema_ver, TraceID.
- Time & synchronization: tau_mono (internal evaluation baseline), ts (public release time), offset, skew, J, Delta_t.
- Path & measure: gamma(ell), d ell, L_gamma = ( ∫_gamma 1 d ell ).
- Arrival-time conventions: T_arr, n_eff, c_ref, delta_form, tol_Tarr.
- Metrology & units: unit(x), dim(x), check_dim(expr), u(x), U = k * u_c.
- Quality & missingness: m ∈ {0,1}, q_score ∈ [0,1], drift.
- Environment & correction: RefCond, corr_env(x; RefCond).
- Reserved conflicts: never mix T_fil with T_trans; strictly distinguish n from n_eff.
III. Axioms (P101-*)
- P101-1 Cleaning-loop axiom
cleaning_loop = { schema, units_dims, timebase, path_arrival, quality_contracts, freeze_release } is the minimal loop, and all elements are mandatory. - P101-2 Two-form concurrency axiom
Any use of T_arr must compute both forms concurrently and persist their difference. - P101-3 Explicit domain-and-measure axiom
Every integral must declare its domain and measure explicitly, including gamma(ell) and d ell for path integrals. - P101-4 Monotone time and path axiom
non_decreasing(tau_mono) and non_decreasing(ell) are hard contracts; violations indicate data errors or the need for correction. - P101-5 Units and dimensions consistency axiom
Publication requires check_dim(expr) to pass; mixing dimensionless with dimensional quantities is forbidden. - P101-6 Contracts-before-release axiom
Artifacts that fail required contracts and thresholds must not proceed to freeze and signature.
IV. Minimal Equations (S101-*)
- S101-1 Release criterion
pass = check_dim ∧ arrival_forms ∧ contract_ok ∧ manifest_signed - S101-2 Arrival-time two forms
T_arr_form1 = ( 1 / c_ref ) * ( ∫ n_eff d ell )
T_arr_form2 = ( ∫ ( n_eff / c_ref ) d ell ) - S101-3 Two-form difference metric
delta_form = | T_arr_form1 - T_arr_form2 | - S101-4 Path length and monotonicity
L_gamma = ( ∫_gamma 1 d ell ), with non_decreasing(ell) = true - S101-5 Time mapping
ts = map_to_pub( tau_mono ; offset, skew, J ) - S101-6 Probability and density conventions
( ∫_Omega p(x) dx ) = 1 for probability-density normalization; physical density uses explicit unit(rho) and dim(rho).
V. Inputs, Outputs, and Manifest
- manifest.schema = { schema_ver, registry, units_policy }
- manifest.timing = { tau_mono, ts, offset, skew, J, window = Delta_t }
- manifest.arrival_forms = { gamma(ell), d ell, c_ref, n_eff, T_arr_form1, T_arr_form2, delta_form, tol_Tarr }
- manifest.qc = { q_score, m_mask, drift }
- manifest.contracts = { unique(pk), non_decreasing(ts|ell), check_dim_set, eps_norm, res_mass, tol_Tarr }
- manifest.signature = { hash_sha256(blob), signature, issuer }
VI. Cleaning Process (M10-1, Master Flow)
- standardize_names(ds, registry)
Harmonize field names and alias mappings, validate schema_ver and required keys. - repair_units(ds, policy)
Normalize units and run check_dim(expr); on failure, quarantine or roll back. - align_timebase(ds, sync_ref)
Establish tau_mono, map ts, estimate and record offset, skew, J. - enforce_arrival_time_convention(ds)
Parameterize gamma(ell) and d ell, compute T_arr_form1 and T_arr_form2, produce delta_form, and compare to tol_Tarr. - handle_missing(ds, strategy)
Emit the m mask; perform interpolation or environmental correction via corr_env(x; RefCond) with associated uncertainty. - detect_outlier(ds, method, fields)
Label outliers, abrupt changes, and drift; adjust q_score and apply down-weighting policies. - deduplicate(ds, keys, semantics) and referential integrity
Remove duplicates, enforce foreign-key consistency, and clean orphan records. - assert_contract(ds, tests)
Execute uniqueness, monotonicity, dimensional consistency, coherence, and range assertions; generate an auditable report. - freeze_release(ds, tag)
Produce the manifest, compute hash_sha256(blob) and sign; complete the release freeze.
VII. Contracts & Assertions
- Uniqueness and integrity: unique(pk), foreign_key.
- Monotonicity: non_decreasing(ts), non_decreasing(ell).
- Dimensional consistency: check_dim(y - f(x)) = 0.
- Arrival-time coherence: delta_form ≤ tol_Tarr.
- Normalization and conservation: eps_norm ≤ tol_norm, res_mass ≤ tol_mass.
- Missingness and quality: coverage = 1 - mean(m), q_score ≥ q_min.
- Drift monitoring: drift ≤ tol_drift, with Delta_t and degrees of freedom fixed in the manifest.
VIII. Boundaries, Risks, and Rollback
- Boundaries
- Cleaning does not replace device calibration, does not infer physical ground-truth for missing samples, and does not perform semantic labeling.
- When the two forms exceed the threshold, first verify path and measure definitions, then consider environmental corrections.
- Risks
- Non-monotone paths or time axes bias arrival-time estimation.
- Implicit errors arising from missing unit and dimension declarations.
- Rollback
- Keep the prior tag’s freeze_release artifacts available for online cutback.
- On contract failures, emit a minimal diagnostic manifest report and do not publish the data plane.
IX. Cross-References
- Acquisition and time semantics: see EFT.WP.Core.Sea v1.0.
- Schemas, fields, and manifests: see EFT.WP.Core.DataSpec v1.0.
- Channels and back-pressure coordination: see EFT.WP.Core.Threads v1.0.
- Density, measures, and normalization: see EFT.WP.Core.Density v1.0.
- Dimensions and metrological flows: see EFT.WP.Core.Metrology v1.0, EFT.WP.Core.Parameters v1.0, EFT.WP.Core.Errors v1.0.
Summary
This chapter establishes the cleaning domain’s objects and boundaries, defines the six-element loop, the two-form harmonization, and the explicit-measure constraint, and provides the release criterion S101-1 alongside the master process M10-1. Subsequent chapters inherit the numbering, variables, and contracts introduced here and extend them to pattern binding, metrological consistency, time and path handling, quality and compliance, and the freeze-and-audit chain.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/