Home / Docs-Technical WhitePaper / 56-Report-Level Methods Appendix Template v1.0
Chapter 5 Data & Experimental Design
I. Chapter Goals & Scope (Mandatory)
- Fix a reproducible caliber for data and experimental design at report level, covering: datasets@version, sampling & controls, variables & batch control, observation windows & randomization, compliance & licensing, and a minimal repro example.
- Aligned with Ch.3 machine structure, Ch.7 metrology & calibration, Ch.8 evaluation protocol, and Ch.11 implementation binding.
II. Data Asset Inventory (Mandatory)
- Dataset identifier: name@version; source, license, and availability window (UTC).
- Partitions & splits: train/valid/test or equivalent; preserve identical hashes and row order across environments.
- Accompanying metadata: acquisition caliber, time coverage, geography, de-identification, missing & outlier handling.
III. Sampling Framework & Randomization (Mandatory)
- Sampling: stratified / cluster / systematic / simple random; specify strata variables and proportions.
- Observation window: [t0, t1]; for rolling experiments specify step/overlap.
- Randomization: fix seeds and provide repro examples; enforce cross-batch leakage prevention (e.g., by subject or geography).
IV. Control & Variable Management (Mandatory)
- Control/baseline: state Option Base; list controlled vs allowed-to-drift variables.
- Covariates & confounders: provide include/exclude lists; link to ablation & sensitivity plans (see Ch.10).
- Batches & regions: use batch and region as grouping keys; report batch-effect correction methods.
V. Measures, Units & Metrological Consistency (Mandatory)
- For any measured quantity tied to equations, provide unit and dim; set check_dim=true in machine layer.
- If the experiment involves arrival-time criteria, use the unified forms:
- Factored: T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- General: T_arr = ( ∫ ( n_eff / c_ref ) d ell )
and in the same paragraph declare path gamma(ell) and measure d ell.
VI. Compliance, Ethics & Security (Mandatory)
- Licensing & restrictions: list licenses, PII/PHI de-identification strategies, and redistribution limits.
- Security: access tiers, audit trail & retention; data destruction & regeneration flows.
- Bias & fairness: monitoring metrics and gates for sampling/evaluation phases.
VII. Minimal Repro Example (Mandatory)
- Container image, data mounts, scripts & command; exported artifacts (yaml/json/pdf) with hashes.
- Repro pass lines: row counts/hash checks, partition consistency, statistical summaries (mean/variance/quantiles).
VIII. Human × Machine Fields (Mandatory)
Human Item | Machine Field | Validation Focus |
|---|---|---|
Dataset list | data.datasets[] | name@version with license/hash complete |
Sampling & window | data.sampling, data.window | Strata variables and time window explicit |
Control & variables | data.controls, data.baseline | Controlled vs drifting clearly separated |
Metrological consistency | metrology.*, math.check_dim | Units/dimensions/CI complete |
Arrival-time caliber | math.statements[], math.path/measure | gamma(ell) and d ell explicit |
Repro example | reproducibility.* | Container/scripts/command/hashes replayable |
Compliance & security | compliance.*, security.* | License/access tiers/retention present |
IX. Field & Constraint List (copy-ready)
Field Path | Type | Required | Constraint |
|---|---|---|---|
data.datasets[].name | string | Yes | Unique |
data.datasets[].version | string | Yes | Semver or date |
data.datasets[].hash | string | Rec. | sha256: prefix |
data.datasets[].license | string | Yes | SPDX or text |
data.partition | obj | Yes | {train,valid,test} ratios |
data.window.{t0,t1,timezone} | obj | Yes | ISO8601 / UTC |
data.sampling.method | enum | Yes | `stratified |
data.sampling.variables[] | list | No | Strata keys |
data.controls.controlled[] | list | No | Controlled vars |
data.controls.drifting[] | list | No | Allowed to drift |
data.baseline | string | Yes | Option Base |
metrology.units[] | list | Yes | e.g., ["s","m","kg"] |
metrology.dimensions[] | list | Yes | e.g., ["T","L","L T^-1"] |
math.statements[] | list | No | Include main equations |
math.path/measure | string | Req. if T_arr | gamma(ell) / d ell |
math.check_dim | bool | Yes | true |
compliance.license_terms | text | Yes | Reuse limits |
security.access_tiers[] | list | Yes | owner/contrib/read |
reproducibility.container | string | Yes | image@sha256:… |
reproducibility.scripts[] | list | Yes | script@commit |
reproducibility.repro_cmd | string | Yes | Minimal repro command |
X. Machine Schema (YAML; JSON-equivalent, Mandatory)
data:
datasets:
- { name: "cmb_set_v3", version: "v3", hash: "sha256:…", license: "CC-BY-4.0" }
- { name: "lens_v1", version: "v1", hash: "sha256:…", license: "CC-BY-4.0" }
partition: { train: 0.7, valid: 0.15, test: 0.15 }
window: { t0: "2025-01-01T00:00:00Z", t1: "2025-06-30T23:59:59Z", timezone: "UTC" }
sampling:
method: "stratified"
variables: ["region","band"]
controls:
controlled: ["instrument","pipeline_version"]
drifting: ["weather","solar_activity"]
baseline: "Option Base"
metrology:
units: ["s","m","kg"]
dimensions: ["T","L","L T^-1"]
calibration: ["Mx-*"]
math:
statements:
- "T_arr = ( ∫ ( n_eff / c_ref ) d ell )"
path: "gamma(ell)"
measure: "d ell"
symbols:
- { name: "n_eff", unit: "1", dim: "1" }
- { name: "c_ref", unit: "m·s^-1", dim: "L T^-1" }
check_dim: true
compliance:
license_terms: "No re-identification; redistribution restricted to partners."
pii_policy: "de-identified; k-anonymity>=10"
security:
access_tiers: ["owner","contrib","read"]
retention_days: 365
audit_log: true
reproducibility:
container: "registry/replay:2025.09@sha256:…"
scripts: ["prep_data.py@a1b2c3","split_partitions.py@9f8e7d"]
repro_cmd: "docker run … prep_data.py --in /mnt/raw --out /mnt/curated && split_partitions.py --seed 20250927"
artifacts: ["yaml","json","pdf"]
XI. Minimal Sample (human abstract × machine snippet, Mandatory)
- Human abstract:
- Use stratified sampling (region × band); observation window 2025-01-01 to 2025-06-30 (UTC). Baseline Option Base.
- Arrival-time uses the general form: T_arr = ( ∫ ( n_eff / c_ref ) d ell ); declare gamma(ell) and d ell in the same paragraph; check_dim=true.
- Machine snippet:
data:
sampling: { method: "stratified", variables: ["region","band"] }
math:
statements: ["T_arr = ( ∫ ( n_eff / c_ref ) d ell )"]
path: "gamma(ell)"
measure: "d ell"
check_dim: true
XII. Validation Rules (regex/consistency, Mandatory)
- Dataset ID: ^[A-Za-z0-9_\-]+@v?\d+(\.\d+)*$; hash: ^sha256:[0-9a-f]{64}$.
- Time window: t0 ≤ t1 and timezone UTC.
- Sampling method: method ∈ {stratified,cluster,systematic,srs}.
- Arrival-time: if T_arr appears, math.path and math.measure are required; math.check_dim=true.
- Partitions: train+valid+test = 1.0 (tolerance ≤ 1e-6).
XIII. Citation & Cross-Reference Style (Mandatory)
; EFT.WP.* citations must include explicit version and anchor, with a machine-readable list in references.see[].“See 《 vX.Y》 Ch.x S/P/M/I…”Fixed format:Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/