Home / Docs-Technical WhitePaper / 53-Model Card Template v1.0
Chapter 5 — Training Data & Lineage
I. Purpose & Scope
- Specify sources, schema & splits, lineage, and license compliance for training data so that train/val/test are fully consistent with the rest of the model card in units/dimensions, coverage mode, versioning, and freshness.
- For path quantities (arrival time/phase), the text must explicitly show gamma(ell) and d ell; the data side records delta_form ∈ {general, factored}; parenthesized unified forms are required; publication requires p_dim = 1.0.
II. Inputs & Dependencies
- Dataset Card: align with Ch. 3/4/6/7/8/10/11/12 (provenance.yaml, schema.json/contract.yaml, split.yaml/split_manifest.json, validate_report.json, report_manifest.yaml).
- Parameter Card: align freshness.policy and cov_group; any calibration/constant used for training must be registered and traceable.
- Error Budget Card: unify coverage mode (k/alpha/quantile) and covariance config (Σ, kernels & params).
- Pipeline Card: inbound contracts and stage interfaces; training/validation/export paths & release layout.
- Citations & versions: “volume + version + anchor (P/S/M/I)”, anchor coverage ≥ 90%, no external links/aliases.
III. Sources & Licenses
- Types: instrument/system/simulator/external; in data_refs.yaml record source_id, producer, license, site, operator.
- License/limits: record license and redistribution terms; if usage/region/audience limits exist, state them in the model card front matter and API docs.
- Time lock & freshness: clock_state="locked", |ts_start − calib.timestamp| ≤ τ_calib; isolate expired samples or tag [Restricted].
IV. Schema & Splits Alignment
- Fields/units/dimensions: training-read fields must match Dataset Card Ch. 4; missingness as null/omitted with reasons in quality.flags.
- Leakage prevention: strictly follow split.yaml time/entity isolation and RNG seed; forbid cross-split entity/window sharing.
- Path consistency: len(gamma_ell)=len(d_ell)=len(n_eff)≥2; step Δell ≤ ( c_ref / f_s ) / max(n_eff); align phase in the reference window before metrics.
V. Sampling & Cleaning
- Sampling: random/stratified/hard-first recorded with RNG and quotas in data_refs.yaml (strata by batch/device/region/quality.flags).
- Cleaning rules: missing/outlier/denoise/normalize in preprocess_spec.yaml; any distribution-altering rule is logged in Bias (Ch. 8 Dataset Card) and evaluated by slices.
- Augmentation/synthesis: record provenance, params, and ratios; count separately in split_manifest.json.
VI. Lineage & Traceability
- Lineage DAG: raw → calibrated → derived → annotated → split → train_batch, nodes/edges labeled with version & checksum; cycles forbidden.
- Event audit: acquisition/clean/split/augment/sample events to audit.jsonl (time, operator, input hashes, change note, signature).
- Reproducibility: train_config.yaml pins data paths/versions/snapshots; provide a minimal replay script reproduce.sh in the appendix.
VII. Normative Path Forms
- Arrival (two equivalent):
T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
T_arr = ( ∫ ( n_eff / c_ref ) d ell ) - Phase:
Phi = ( 2π / λ_ref ) * ( ∫ n_eff d ell )
In text, explicitly show gamma(ell) and d ell; on data side record delta_form; training/eval path/phase conventions must match the Dataset Card.
VIII. Gate Mapping
- G1 Schema completeness: training-read fields match contracts.
- G2 Citation compliance: anchor coverage ≥ 90%.
- G3 Path conventions: gamma/measure/delta_form complete; step & alignment compliant.
- G4 Dimensional closure: end-to-end I70-dim_check passed, p_dim = 1.0.
- G5 Freshness: clock_state="locked", τ_calib valid.
- G6 Coverage consistency: training/evaluation use the same k/alpha/quantile.
- G7 Covariance consistency: Σ PD and aligned with the Error Budget.
- G8 Uniqueness & acyclicity: unique record_id/checksum, Lineage DAG acyclic.
- Any core-gate failure triggers S1–S5 (dimension/freshness/path/covariance/citation), blocking training & release; tag [Restricted] when necessary.
IX. Machine-Readable Artifacts
A. data_refs.yaml
version: "1.0.0"
datasets:
- id: "ds-core"
see:
- "Dataset Card v1.0:Ch.3"
- "Dataset Card v1.0:Ch.4"
- "Dataset Card v1.0:Ch.6"
manifest: "DS_EXPORT/manifests/report_manifest.yaml"
splits: "DS_EXPORT/splits/split_manifest.json"
license: "CC-BY-4.0"
checksum: "sha256:..."
sampling:
seed: 20250924
strategy: { stratified: ["device","region","quality.flags"] }
preprocess_spec: "configs/preprocess_spec.yaml"
B. preprocess_spec.yaml
version: "1.0.0"
missing: { numeric: "null", route_to: "quality.flags" }
normalize: { mean: "μ_train", std: "σ_train" }
path_align: { require: true, delta_form: "general", enforce_delta_ell: true }
filters:
- name: "window_guard"
rule: "drop if ts ∉ [ts_start, ts_end]"
audits: { write_to: "reports/audit.jsonl" }
C. lineage_graph.json (excerpt)
X. Anti-Patterns & Fixes
- Anti: T_arr = ∫ n_eff / c_ref d ell (missing parentheses) → Fix: parenthesize to the unified form.
- Anti: training uses only gamma(ell) without d ell/delta_form → Fix: complete and equalize with n_eff; reject if non-compliant.
- Anti: split leakage (entity across splits) → Fix: recut with group_by(entity) and update split_manifest.json.
- Anti: expired samples or unlocked clocks → Fix: filter per freshness.policy or isolate; tag [Restricted] if needed.
- Anti: inconsistent coverage between training and evaluation → Fix: unify a single coverage.mode and declare in manifests.
XI. Cross-References
- Dataset Card: Ch. 3 (Provenance), Ch. 4 (Schema), Ch. 6 (Splits/Versioning), Ch. 7 (QC Gates), Ch. 8 (UQ & Cov), Ch. 11 (Bench/Score).
- Error Budget Card: Ch. 5/6/8 (covariance & coverage), Ch. 9 (threshold mapping).
- Parameter Card: Ch. 4/6/8/9 (units/freshness/covariance groups/interfaces).
- Pipeline Card: Ch. 3/4/6/12 (graph/contract/stages/release).
XII. Checklist
- data_refs.yaml / preprocess_spec.yaml / lineage_graph.json ready and consistent with Dataset Card manifests.
- For path quantities, explicit gamma/measure/delta_form; len(path) ≥ 2, Δell compliant; phase aligned in reference window.
- I70-dim_check passed, p_dim = 1.0; coverage k/alpha/quantile unified with Error Budget.
- Leakage prevention and license/usage verification complete; audit.jsonl fully recorded.
- /validate passed G1–G8; non-compliances marked [Restricted]; citation anchor coverage ≥ 90%.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/