53-Model Card Template v1.0 | Chapter 5 — Training Data & Lineage

Home ／ Docs-Technical WhitePaper (V6.0) ／ 53-Model Card Template v1.0

Chapter 5 — Training Data & Lineage

I. Purpose & Scope

Specify sources, schema & splits, lineage, and license compliance for training data so that train/val/test are fully consistent with the rest of the model card in units/dimensions, coverage mode, versioning, and freshness.
For path quantities (arrival time/phase), the text must explicitly show gamma(ell) and d ell; the data side records delta_form ∈ {general, factored}; parenthesized unified forms are required; publication requires p_dim = 1.0.

II. Inputs & Dependencies

Dataset Card: align with Ch. 3/4/6/7/8/10/11/12 (provenance.yaml, schema.json/contract.yaml, split.yaml/split_manifest.json, validate_report.json, report_manifest.yaml).
Parameter Card: align freshness.policy and cov_group; any calibration/constant used for training must be registered and traceable.
Error Budget Card: unify coverage mode (k/alpha/quantile) and covariance config (Σ, kernels & params).
Pipeline Card: inbound contracts and stage interfaces; training/validation/export paths & release layout.
Citations & versions: “volume + version + anchor (P/S/M/I)”, anchor coverage ≥ 90%, no external links/aliases.

III. Sources & Licenses

Types: instrument/system/simulator/external; in data_refs.yaml record source_id, producer, license, site, operator.
License/limits: record license and redistribution terms; if usage/region/audience limits exist, state them in the model card front matter and API docs.
Time lock & freshness: clock_state="locked", |ts_start − calib.timestamp| ≤ τ_calib; isolate expired samples or tag [Restricted].

IV. Schema & Splits Alignment

Fields/units/dimensions: training-read fields must match Dataset Card Ch. 4; missingness as null/omitted with reasons in quality.flags.
Leakage prevention: strictly follow split.yaml time/entity isolation and RNG seed; forbid cross-split entity/window sharing.
Path consistency: len(gamma_ell)=len(d_ell)=len(n_eff)≥2; step Δell ≤ ( c_ref / f_s ) / max(n_eff); align phase in the reference window before metrics.

V. Sampling & Cleaning

Sampling: random/stratified/hard-first recorded with RNG and quotas in data_refs.yaml (strata by batch/device/region/quality.flags).
Cleaning rules: missing/outlier/denoise/normalize in preprocess_spec.yaml; any distribution-altering rule is logged in Bias (Ch. 8 Dataset Card) and evaluated by slices.
Augmentation/synthesis: record provenance, params, and ratios; count separately in split_manifest.json.

VI. Lineage & Traceability

Lineage DAG: raw → calibrated → derived → annotated → split → train_batch, nodes/edges labeled with version & checksum; cycles forbidden.
Event audit: acquisition/clean/split/augment/sample events to audit.jsonl (time, operator, input hashes, change note, signature).
Reproducibility: train_config.yaml pins data paths/versions/snapshots; provide a minimal replay script reproduce.sh in the appendix.

VII. Normative Path Forms

Arrival (two equivalent):
T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
T_arr = ( ∫ ( n_eff / c_ref ) d ell )
Phase:
Phi = ( 2π / λ_ref ) * ( ∫ n_eff d ell )

In text, explicitly show gamma(ell) and d ell; on data side record delta_form; training/eval path/phase conventions must match the Dataset Card.

VIII. Gate Mapping

G1 Schema completeness: training-read fields match contracts.
G2 Citation compliance: anchor coverage ≥ 90%.
G3 Path conventions: gamma/measure/delta_form complete; step & alignment compliant.
G4 Dimensional closure: end-to-end I70-dim_check passed, p_dim = 1.0.
G5 Freshness: clock_state="locked", τ_calib valid.
G6 Coverage consistency: training/evaluation use the same k/alpha/quantile.
G7 Covariance consistency: Σ PD and aligned with the Error Budget.
G8 Uniqueness & acyclicity: unique record_id/checksum, Lineage DAG acyclic.
Any core-gate failure triggers S1–S5 (dimension/freshness/path/covariance/citation), blocking training & release; tag [Restricted] when necessary.

IX. Machine-Readable Artifacts
A. data_refs.yaml

version: "1.0.0"

datasets:

- id: "ds-core"

see:

- "Dataset Card v1.0:Ch.3"

- "Dataset Card v1.0:Ch.4"

- "Dataset Card v1.0:Ch.6"

manifest: "DS_EXPORT/manifests/report_manifest.yaml"

splits: "DS_EXPORT/splits/split_manifest.json"

license: "CC-BY-4.0"

checksum: "sha256:..."

sampling:

seed: 20250924

strategy: { stratified: ["device","region","quality.flags"] }

preprocess_spec: "configs/preprocess_spec.yaml"

B. preprocess_spec.yaml

version: "1.0.0"

missing: { numeric: "null", route_to: "quality.flags" }

normalize: { mean: "μ_train", std: "σ_train" }

path_align: { require: true, delta_form: "general", enforce_delta_ell: true }

filters:

- name: "window_guard"

rule: "drop if ts ∉ [ts_start, ts_end]"

audits: { write_to: "reports/audit.jsonl" }

C. lineage_graph.json (excerpt)

JSON json

{
  "nodes": [
    { "id": "RAW-telemetry", "version": "1.0.0", "checksum": "sha256:..." },
    { "id": "CAL-telemetry", "version": "1.0.1", "checksum": "sha256:..." },
    { "id": "DER-train", "version": "1.0.0", "checksum": "sha256:..." }
  ],
  "edges": [
    { "from": "RAW-telemetry", "to": "CAL-telemetry", "type": "calibrate" },
    { "from": "CAL-telemetry", "to": "DER-train", "type": "derive" }
  ]
}

X. Anti-Patterns & Fixes

Anti: T_arr = ∫ n_eff / c_ref d ell (missing parentheses) → Fix: parenthesize to the unified form.
Anti: training uses only gamma(ell) without d ell/delta_form → Fix: complete and equalize with n_eff; reject if non-compliant.
Anti: split leakage (entity across splits) → Fix: recut with group_by(entity) and update split_manifest.json.
Anti: expired samples or unlocked clocks → Fix: filter per freshness.policy or isolate; tag [Restricted] if needed.
Anti: inconsistent coverage between training and evaluation → Fix: unify a single coverage.mode and declare in manifests.

XI. Cross-References

Dataset Card: Ch. 3 (Provenance), Ch. 4 (Schema), Ch. 6 (Splits/Versioning), Ch. 7 (QC Gates), Ch. 8 (UQ & Cov), Ch. 11 (Bench/Score).
Error Budget Card: Ch. 5/6/8 (covariance & coverage), Ch. 9 (threshold mapping).
Parameter Card: Ch. 4/6/8/9 (units/freshness/covariance groups/interfaces).
Pipeline Card: Ch. 3/4/6/12 (graph/contract/stages/release).

XII. Checklist

data_refs.yaml / preprocess_spec.yaml / lineage_graph.json ready and consistent with Dataset Card manifests.
For path quantities, explicit gamma/measure/delta_form; len(path) ≥ 2, Δell compliant; phase aligned in reference window.
I70-dim_check passed, p_dim = 1.0; coverage k/alpha/quantile unified with Error Budget.
Leakage prevention and license/usage verification complete; audit.jsonl fully recorded.
/validate passed G1–G8; non-compliances marked [Restricted]; citation anchor coverage ≥ 90%.

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05