Home / Docs-Technical WhitePaper / 45-EFT.WP.Data.Pipeline v1.0
Chapter 6 Data Validation & Quality Gates
I. Chapter Purpose & Scope
specifications in pipelines: rule types, sampling & significance, blocking vs. warning levels, exception handling, auditing & exports; ensure alignment with Σ_in/Σ_out contracts, splits/coverage, metrology, and citation anchors.DQ gates and data validationFixII. Terminology & Dependencies
- Terms: dq_rules, pass_rate, shadow, quarantine, significance.alpha, blocking/warning, leakage_guard, drift.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/quality (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0).
- Math & symbols: wrap inline symbols (α, QPS, T_inf, ρ, u_c) in backticks; any division/integral/composite operator must use parentheses; if path quantities T_arr appear, register gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "schema.check|dq.scan|leakage.audit"
type: "validate.schema|validate.dq|validate.leakage"
impl: "I16-2.schema_check|I16-7.dq_scan|I16-8.leakage_audit"
inputs: ["<upstream_artifact>"]
outputs: ["<clean_rows>|<dq_report>|<leakage_report>"]
schema_ref: "contracts/<name>@vX.Y"
dq:
sample: {rows: 50000, strategy: "head|random|stratified"}
significance: {alpha: 0.05}
gates:
- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}
- {id:"DQ_002", kind:"unique", cols:[["id","ts"]], level:"block"}
- {id:"DQ_003", kind:"range", col:"value", rule:"[0,1e6]", unit:"<SI>", level:"block"}
- {id:"DQ_004", kind:"enum", col:"status", values:["ok","warn","err"], level:"block"}
- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=200", level:"warn"}
- {id:"DQ_006", kind:"freshness", col:"updated_at", max_lag:"PT30M", level:"warn"}
- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}
- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}
on_fail: "quarantine|skip|block"
retries: {max: 2, backoff: "expo"}
timeout_s: 1800
IV. Rule Types & Decision Posture
- Integrity: not_null, unique, and primary-key consistency.
- Value & units: range (explicit interval closure), unit (SI check aligned with constraints.units); normalize units first before composing derived metrics.
- Enums & semantics: stable enumerations with admission policy for unseen values (unknown|reject|map-to-other).
- Freshness & coverage: freshness.max_lag, sampling coverage and minimum sample counts.
- Distributional consistency: distribution (quantiles/p99/KS/AD); pair with significance level α and report p-values & intervals.
- Data drift: drift.psi/kl/ks; defaults psi<=0.2 (warn), psi<=0.3 (block) can be overridden.
- Leakage audit: leakage.policy (per-object|per-timewindow|per-scene); cross-splits overlap is blocking.
- Contract consistency: schema_ref fields/types/units/key constraints aligned with Σ_in/Σ_out.
V. Sampling, Significance & Severity
- Sampling: sample.rows and strategy:"head|random|stratified"; for stratified sampling, declare strata keys & quotas.
- Significance: statistical tests at default α=0.05; report p-values and effect sizes; blocking requires dual conditions (threshold violation and p<α).
- Severity levels: level:"block|warn"; block triggers on_fail and quarantine exports; warn logs and alerts only.
VI. Exception Handling & Audit Exports
- Handling: on_fail:"quarantine|skip|block"; quarantine artifacts record paths, hashes, and mismatch reasons.
- Audit: produce dq/report.jsonl (per-rule records), dq/summary.csv (rollup), dq/leakage_report.csv; register sha256 in export_manifest.artifacts[].
VII. Metrology & Units (SI)
- Perf/time metrics: QPS (1/s), T_inf (ms with {p50,p95,p99}), ρ (unitless); bandwidth net_mbps, volume size_bytes.
- metrology:{units:"SI", check_dim:true} is mandatory; range/unit/distribution rules must pass SI checks.
- For path quantities (e.g., T_arr), register in the rule or stage config: delta_form, path="gamma(ell)", measure="d ell", and validate via one of:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
VIII. Machine-Readable Fragment (Drop-in)
layers:
- name: "validate"
stages:
- name: "dq.scan"
type: "validate.dq"
impl: "I16-7.dq_scan"
inputs: ["clean_rows"]
outputs: ["dq_report"]
schema_ref: "contracts/clean_rows@v1.3"
dq:
sample: {rows: 100000, strategy: "stratified"}
significance: {alpha: 0.05}
gates:
- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}
- {id:"DQ_003", kind:"range", col:"power_w", rule:"[0,2e3]", unit:"W", level:"block"}
- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=150", level:"warn"}
- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}
- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}
on_fail: "quarantine"
retries: {max: 2, backoff: "expo"}
timeout_s: 1800
IX. Lint Rules (Excerpt, Normative)
lint_rules:
- id: DQ.SCHEMA_REF_REQUIRED
when: "$.layers[*].stages[?(@.type=='validate.dq')]"
assert: "has_key('schema_ref')"
level: error
- id: DQ.SAMPLE_DEFINED
when: "$.layers[*].stages[?(@.type=='validate.dq')].dq.sample"
assert: "value.rows > 0 and value.strategy in ['head','random','stratified']"
level: error
- id: DQ.LEVEL_ALLOWED
when: "$.layers[*].stages[*].dq.gates[*].level"
assert: "value in ['block','warn']"
level: error
- id: DQ.RANGE_UNIT_SI
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='range')]"
assert: "is_SI_unit($.unit)"
level: error
- id: DQ.DRIFT_THRESHOLDS
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='drift')]"
assert: "psi_threshold_ok($.metric)"
level: warn
- id: DQ.LEAKAGE_POLICY
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='leakage')]"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
X. Export Manifest & Reports
export_manifest:
version: "v1.0"
artifacts:
- {path:"dq/report.jsonl", sha256:"..."}
- {path:"dq/summary.csv", sha256:"..."}
- {path:"dq/leakage_report.csv",sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.12"
XI. Chapter Compliance Checklist
- dq.sample/significance set; rules cover integrity/value/enum/freshness/distribution/drift/leakage.
- Severity & handling clear: block quarantines and stops; warn logs & alerts; audit artifacts with sha256 registered.
- schema_ref aligns with contracts; units in SI and check_dim=true; consistent units for range/distribution/perf metrics.
- Leakage guardrails effective; cross-splits overlap is blocking; path quantities (if any) registered & validated.
- export_manifest lists reports & citation anchors and meets release gates.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/