Home / Docs-Technical WhitePaper / 43-EFT.WP.Data.DatasetCards v1.0
Chapter 7 Cleaning & Preprocessing
I. Chapter Purpose & Scope
for cleaning and preprocessing, parameter locking, environment & randomness control, contamination removal, and normalization postures; ensure reproducibility and auditability. Use “Volume+Version+Anchor” for clause-level citations; keys use snake_case. process-oriented recordsFormalizeII. Terminology & Dependencies
- Terminology: General terms follow EFT.WP.Core.Terms v1.0; this chapter only adds fields tied to pipelines and parameters.
- Dependent volumes: Data contract/export: Core.DataSpec v1.0; metrology/units/uncertainty: Core.Metrology v1.0; path/arrival-time equations: Core.Equations v1.1. All math must use backticks and no Chinese.
III. Fields & Structure (Normative)
preprocess:
pipeline_id: "<string>" # Pipeline identifier (semantic name)
steps: # Ordered, idempotent steps
- name: "<denoise|filter|rfi_clean|normalize|resample|impute|clip|custom>"
enabled: true
params: { ... } # Fully explicit parameters with units
idempotent: true
inputs: ["<field>"]
outputs: ["<field>"]
notes: "<non-normative>"
parameter_lock: true # Freeze parameters prior to release
randomness:
seed: 1701
libraries: {numpy:"1.26.4", torch:"2.3.1"}
environment:
os: "ubuntu22.04"
toolchain: ["python3.11","fftw3"]
containers: ["ghcr.io/eift/card-prep:1.0.2"]
audits: ["nan-check","range-check","leakage","class-imbalance"]
artifacts:
- path: "preprocess/logs/step-01.jsonl"
sha256: "..."
- path: "preprocess/configs/lock.yaml"
sha256: "..."
see:
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
(Exported artifacts must appear in export_manifest.references[] and carry sha256.)
IV. Canonical Postures for Common Operations
- Denoising denoise: Declare algorithm (e.g., median-k=3, wavelet-db8), windowing, boundary handling; numeric fields must pass dimensional consistency via check_dim.
- Filtering filter: Specify type (lowpass|bandpass|notch), cutoff/band, order, window, and phase posture (zero-phase/causal).
- RFI/Anomaly cleaning rfi_clean|clip: Provide thresholds, masking strategy, and restore/imputation posture; record removed-sample ratio.
- Normalization normalize: Specify basis (zscore/minmax/robust), stats window, and leakage control (no stats from val/test).
- Resampling resample: Declare target sample rate f_samp, anti-aliasing filter, and interpolation method.
- Missing-data handling impute: Declare strategy (mean/median/KNN/model), affected fields, and uncertainty bookkeeping (to uncertainty extension).
- Custom custom: Provide executable reference (script/container) and parameter hash; bind artifacts in related_artifacts[] and include in export.
V. Parameter Locking & Randomness Control
- parameter_lock=true is a pre-release requirement; persist all params in a lock file (with units/dimensions).
- Randomness: Fix seed and library versions; for parallel/distributed runs, declare deterministic backends or tolerated nondeterminism.
VI. Data Integrity & Contamination Control
- nan-check/range-check: State valid domains per field and out-of-range policy (drop/clip/impute).
- Duplicates/Leakage: De-duplicate at object/time-window level; cross-splits leakage is a blocking issue—record under audits and enforce thresholds in quality.gates.
VII. Environment & Reproducibility
- Record OS, dependencies, container images, and launch command; export locked configs and execution logs with sha256.
- If the pipeline touches path-dependent quantities (e.g., T_arr), register in the card: delta_form, path="gamma(ell)", measure="d ell". Two equivalent expressions coexist:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )
and pass dimensional consistency via check_dim.
VIII. Quality Checks & Metrics
- Coverage: Before/after sample counts and distribution comparison.
- SNR metrics: SNR_before/after, artifact rate, RFI mask ratio.
- Consistency: Drift in key statistics (quantiles/autocorr/spectrum) with thresholds; failure implies card release failure.
IX. Coupling with Export Manifest
export_manifest:
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
artifacts:
- {path:"preprocess/logs/step-*.jsonl", sha256:"..."}
- {path:"preprocess/configs/lock.yaml", sha256:"..."}
(All cleaning/preprocessing artifacts must appear in the export manifest and be verifiable.)
X. Example Fragment (drop-in)
preprocess:
pipeline_id: "rf-frb-clean-v1"
steps:
- name: "rfi_clean"
enabled: true
params: {method:"spectral-kurtosis", window:256, thr_sigma:5}
idempotent: true
inputs: ["raw_spec"]
outputs: ["mask_spec"]
- name: "filter"
enabled: true
params: {type:"bandpass", f_lo_hz:1.2e6, f_hi_hz:3.8e6, order:5, phase:"zero"}
idempotent: true
inputs: ["raw_ts"]
outputs: ["flt_ts"]
- name: "normalize"
enabled: true
params: {type:"zscore", stats_from:"train-only", clip_q:[0.01,0.99]}
idempotent: true
inputs: ["flt_ts"]
outputs: ["norm_ts"]
parameter_lock: true
randomness: {seed: 1701, libraries:{numpy:"1.26.4"}}
environment: {os:"ubuntu22.04", containers:["ghcr.io/eift/card-prep:1.0.2"]}
audits: ["nan-check","leakage","class-imbalance"]
(Export configs and logs via export_manifest.artifacts[] and include clause-level anchors.)
XI. Chapter Compliance Checklist
- preprocess.pipeline_id/steps[], parameters, environment, randomness, and audits are fully recorded and locked; logs/configs exported with sha256.
- All formulas/symbols use backticks and parentheses, with no Chinese; cross-volume references use "Volume vX.Y:Anchor".
- For T_arr, delta_form/path/measure are registered and check_dim passes.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/