HomeDocs-Technical WhitePaper46-EFT.WP.Data.Benchmarks v1.0

Chapter 5 Data Sources, Sampling & Frozen Splits


I. Chapter Purpose & Scope

: source compliance and citation, sampling strategies and stratification, frozen indices and consistency, leakage prevention and audit exports; ensure alignment with Dataset/Model Cards/Pipeline, metrology, and citation anchors.frozen splits, and sampling, data sourcesFix specifications for

II. Terminology & Dependencies


III. Fields & Structure (Normative)

data:

dataset_ref: "datasets/<name>@vX.Y" # reference, do not copy

sources: ["<uri-or-citation>", "..."] # data sources & citations

licensing: "CC-BY-4.0|ODC-BY|custom"

provenance:

collection_window: "<YYYY-MM-DD..YYYY-MM-DD>"

geography: ["<region>"]

permits: ["<ethics/permit-ref>"]

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"<label|locale|domain|difficulty|snr_bin>", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq|none"} # training re-weighting statement

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow","per-scene"]

audits:

report: "splits/leakage_report.csv"

sha256: "<hex>"


IV. Source Compliance & Citation Posture


V. Sampling Strategy & Stratification


VI. Frozen Splits & Consistency


VII. Leakage Prevention & Audit Exports


VIII. Metrology & Units (SI)

  1. Performance & volume: QPS(1/s), T_inf(ms), ρ(—), net_mbps, size_bytes.
  2. Mandatory: metrology:{units:"SI", check_dim:true}; normalize units first before composing derived quantities.
  3. Path quantities (T_arr): if sampling/splits couple to path-dependent quantities, register delta_form, path="gamma(ell)", measure="d ell", and use
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
      with check_dim.

IX. Machine-Readable Fragment (Drop-in)

data:

dataset_ref: "datasets/core_cls@v1.0"

sources: ["doi:10.1234/core-ds", "arXiv:2501.01234"]

licensing: "CC-BY-4.0"

provenance: {collection_window:"2024-01-01..2025-06-30", geography:["EU","US"], permits:["ethics-IRB-2024-09"]}

sampling:

strategy: "stratified"

strata: [{by:"label", buckets:{"A":520,"B":2100,"C":12380}}]

weights: {class:"inverse_freq"}

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"..."}

val: {frozen:true, index:"splits/val.index", sha256:"..."}

test: {frozen:true, index:"splits/test.index", sha256:"..."}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow"]

audits: {report:"splits/leakage_report.csv", sha256:"..."}

metrology: {units:"SI", check_dim:true}


X. Lint Rules (Excerpt, Normative)

lint_rules:

- id: DATA.REF_FORMAT

when: "$.data.dataset_ref"

assert: "matches('^datasets/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: SAMPLE.STRATEGY_ALLOWED

when: "$.sampling.strategy"

assert: "value in ['random','stratified','time-based','spatial-tiles','systematic']"

level: error

- id: SPLITS.RATIO_SUM

when: "$.splits.ratio"

assert: "abs(value.train + value.val + value.test - 1) <= 1e-6"

level: error

- id: SPLITS.FROZEN_REQUIRED

when: "$.splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen and splits.freeze_indices == true"

level: error

- id: LEAKAGE.GUARD_PRESENT

when: "$.leakage_guard.policy"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: AUDIT.REPORT_HASH

when: "$.leakage_guard.audits"

assert: "has_keys(report, sha256)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. Cross-Reference Anchors


XII. Chapter Compliance Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/