HomeDocs-Technical WhitePaper52-Dataset Card Template v1.0

Chapter 6 — Splits, Versioning & Freshness


I. Purpose & Scope


II. Prerequisites & Inputs


III. Splits Strategy

  1. Split set: train / val / test / holdout / slice_k; each split uniquely named in split.yaml with intent documented.
  2. Leakage prevention:
    • Temporal leakage: cut by time windows (TS → {train < val < test}), forbid cross-window mixing.
    • Entity leakage: split by group_by(entity) so entities do not cross splits.
    • Path consistency: within any split, arrays gamma_ell/d_ell/n_eff have consistent length and sampling step.
  3. Stratification & sampling: stratify by batch/device/region/quality.flags; maintain class/hard-case ratios when needed.
  4. Slices: define slice_k for key subgroups (extreme conditions/low SNR/specific regions) with explicit selection rules.
  5. Reproducibility: record random seed, algorithm, and params in split.yaml; produce split_manifest.json with counts & checksums.

IV. Versioning (SemVer)


V. Freshness / Validity


VI. Gate Mapping


VII. Machine-Readable Configs
A. split.yaml

version: "1.0.0"

seed: 20250924

strategy:

group_by: ["entity_id"]

time_ordered: true

splits:

train: 0.70

val: 0.15

test: 0.15

constraints:

leakage:

time: { enforce: true }

entity: { enforce: true }

path:

require_alignment: true

delta_form: "general"

coverage:

mode: "k" # k|alpha|quantile

k: 2

B. split_manifest.json (excerpt)

{

"dataset_version": "1.2.0",

"splits": {

"train": { "count": 120345, "checksum": "sha256:..." },

"val": { "count": 25780, "checksum": "sha256:..." },

"test": { "count": 25812, "checksum": "sha256:..." }

},

"slices": {

"low_snr": { "count": 8142, "rule": "snr<5" }

},

"freshness": { "valid_from": "2025-09-01T00:00:00Z", "valid_to": "2026-03-01T00:00:00Z",

"policy": { "tau_calib_s_max": 86400, "clock_state": "locked" } }

}


C. version_matrix.yaml

dataset: "ds-core"

current: "1.2.0"

compatibility:

"1.2.x": { api: ">=1.2,<2.0", schema: ">=1.2,<2.0" }

"1.1.x": { api: ">=1.1,<1.3", schema: ">=1.1,<1.3" }

migration:

from: "1.1.x"

to: "1.2.x"

steps:

- change: "add slice 'low_snr'"

- change: "add field quality.score_Q"

rollback:

tag: "v1.1.3-lock"


VIII. Anti-Patterns & Fixes


IX. Cross-References


X. Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/