HomeDocs-Technical WhitePaper43-EFT.WP.Data.DatasetCards v1.0

Chapter 7 Cleaning & Preprocessing


I. Chapter Purpose & Scope

for cleaning and preprocessing, parameter locking, environment & randomness control, contamination removal, and normalization postures; ensure reproducibility and auditability. Use “Volume+Version+Anchor” for clause-level citations; keys use snake_case. process-oriented recordsFormalize

II. Terminology & Dependencies


III. Fields & Structure (Normative)

preprocess:

pipeline_id: "<string>" # Pipeline identifier (semantic name)

steps: # Ordered, idempotent steps

- name: "<denoise|filter|rfi_clean|normalize|resample|impute|clip|custom>"

enabled: true

params: { ... } # Fully explicit parameters with units

idempotent: true

inputs: ["<field>"]

outputs: ["<field>"]

notes: "<non-normative>"

parameter_lock: true # Freeze parameters prior to release

randomness:

seed: 1701

libraries: {numpy:"1.26.4", torch:"2.3.1"}

environment:

os: "ubuntu22.04"

toolchain: ["python3.11","fftw3"]

containers: ["ghcr.io/eift/card-prep:1.0.2"]

audits: ["nan-check","range-check","leakage","class-imbalance"]

artifacts:

- path: "preprocess/logs/step-01.jsonl"

sha256: "..."

- path: "preprocess/configs/lock.yaml"

sha256: "..."

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

(Exported artifacts must appear in export_manifest.references[] and carry sha256.)


IV. Canonical Postures for Common Operations


V. Parameter Locking & Randomness Control


VI. Data Integrity & Contamination Control


VII. Environment & Reproducibility

  1. Record OS, dependencies, container images, and launch command; export locked configs and execution logs with sha256.
  2. If the pipeline touches path-dependent quantities (e.g., T_arr), register in the card: delta_form, path="gamma(ell)", measure="d ell". Two equivalent expressions coexist:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )
      and pass dimensional consistency via check_dim.

VIII. Quality Checks & Metrics


IX. Coupling with Export Manifest

export_manifest:

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

artifacts:

- {path:"preprocess/logs/step-*.jsonl", sha256:"..."}

- {path:"preprocess/configs/lock.yaml", sha256:"..."}

(All cleaning/preprocessing artifacts must appear in the export manifest and be verifiable.)


X. Example Fragment (drop-in)

preprocess:

pipeline_id: "rf-frb-clean-v1"

steps:

- name: "rfi_clean"

enabled: true

params: {method:"spectral-kurtosis", window:256, thr_sigma:5}

idempotent: true

inputs: ["raw_spec"]

outputs: ["mask_spec"]

- name: "filter"

enabled: true

params: {type:"bandpass", f_lo_hz:1.2e6, f_hi_hz:3.8e6, order:5, phase:"zero"}

idempotent: true

inputs: ["raw_ts"]

outputs: ["flt_ts"]

- name: "normalize"

enabled: true

params: {type:"zscore", stats_from:"train-only", clip_q:[0.01,0.99]}

idempotent: true

inputs: ["flt_ts"]

outputs: ["norm_ts"]

parameter_lock: true

randomness: {seed: 1701, libraries:{numpy:"1.26.4"}}

environment: {os:"ubuntu22.04", containers:["ghcr.io/eift/card-prep:1.0.2"]}

audits: ["nan-check","leakage","class-imbalance"]

(Export configs and logs via export_manifest.artifacts[] and include clause-level anchors.)


XI. Chapter Compliance Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/