HomeDocs-Technical WhitePaper43-EFT.WP.Data.DatasetCards v1.0

Chapter 11 Splits & Distribution


I. Chapter Purpose & Scope

Fix definitions, ratios, and consistency constraints for train/validation/test splits; standardize distribution manifests, mirrors and sharding, integrity checks, and rate/region compliance. All keys use snake_case; cross-volume citations follow “Volume+Version+Anchor.”

II. Terminology & Dependencies


III. Fields & Structure (Normative)

splits:

train: {count: <int>, ratio: <0..1>}

validation: {count: <int>, ratio: <0..1>}

test: {count: <int>, ratio: <0..1>}

policy:

leakage_guard: ["per-object","per-timewindow"] # leakage-prevention granularity

stratify_by: ["class","region","snr_bin"] # align with sampling.strata

freeze_indices: true # freeze indices for reproducibility

audit:

coverage: {by:"class", report:true}

leakage: {cross_split:"forbid"}

imbalance: {metric:"gini", threshold: 0.2}

distribution:

packaging:

format: "tgz" # tgz | zip | parquet | zarr | other

shard_bytes: 134217728 # example 128 MiB

layout: ["train","validation","test"]

mirrors: ["https://mirror-a.example/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

regional_compliance: ["EU-GDPR","CN-DSR"] # example only

checksums:

package: {sha256: "<hex>"} # top-level package integrity

shards:

- {path:"train-000.tgz", sha256:"<hex>"}

- {path:"train-001.tgz", sha256:"<hex>"}

see:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(Exported artifacts and anchors are recorded and verifiable in the export_manifest.)


IV. Split Definitions & Consistency Constraints


V. Distribution & Artifact Organization


VI. Linkage to Quality & Baselines


VII. Metrology & Units (when splitting by time/frequency/space)


VIII. Export Manifest & References (Normative)

export_manifest:

version: "v1.0"

artifacts:

- {path:"splits/train.index", sha256:"..."}

- {path:"splits/validation.index", sha256:"..."}

- {path:"splits/test.index", sha256:"..."}

- {path:"packages/train-000.tgz", sha256:"..."}

- {path:"packages/train-001.tgz", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(All artifacts must be listed and verifiable; references carry Volume+Version+Anchor.)


IX. Example Fragment (drop-in)

splits:

train: {count: 12000, ratio: 0.8}

validation: {count: 1500, ratio: 0.1}

test: {count: 1500, ratio: 0.1}

policy:

leakage_guard: ["per-object","per-timewindow"]

stratify_by: ["class","snr_bin"]

freeze_indices: true

distribution:

packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}

mirrors: ["https://mirror-a.example/datasets/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

checksums:

package: {sha256: "…"}

shards:

- {path:"train-000.tgz", sha256:"…"}

- {path:"train-001.tgz", sha256:"…"}


X. Chapter Compliance Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/