Home / Docs-Technical WhitePaper / 53-Model Card Template v1.0
Chapter 8 — Benchmarks & Comparative Scoring (Bench/Score)
I. Purpose & Scope
- Standardize benchmark tasks and comparative scoring: task definitions, statistical conventions, leakage prevention, and release rules; unify metric intervals, weights, and gate mapping so evaluations are reproducible, auditable, and comparable.
- For path-related visualizations/scoring (arrival/phase), the text must explicitly show gamma(ell) and d ell, record delta_form ∈ {general, factored} on the data side; use parenthesized forms; publication requires p_dim = 1.0.
II. Prerequisites & Inputs
- Data & splits: align with Dataset Card Ch. 4/6/7/11 (schema/splits/QC/bench); forbid cross-split entity/window mixing.
- Training protocol: align with this volume’s Ch. 6 (train_config.yaml, seeds & env snapshot).
- Coverage & covariance: unify with Error Budget (coverage ∈ {k, alpha, quantile}, Σ PD).
- Citations & versions: “volume + version + anchor (P/S/M/I)”, anchor coverage ≥ 90%; public v1.* only.
III. Bench Tasks & Comparability
- Task definition: classification/regression/time-series/path/multimodal; specify input/output fields, units/dimensions, evaluation window.
- Contract alignment: for public baselines, list field mappings & differences (bench_plan.yaml); internal benchmarks require fixed splits & seeds.
- Repetition & convergence: report point estimates and intervals (k/alpha/quantile) for each metric, with repeat/bootstrap convergence diagnostics.
IV. Leakage Prevention & Consistency
- Temporal: monotone TS → {train < val < test} splitting; forbid cross-window feature derivation.
- Entity: group_by(entity); ensure entities do not cross splits.
- Path: len(gamma_ell)=len(d_ell)=len(n_eff)≥2, Δell ≤ ( c_ref / f_s ) / max(n_eff); align phase in the reference window before computing r_phi.
V. Metrics & Intervals
- Primary metrics (examples): AUC, ACC, MAE, RMSE, r_phi, ε_flux, Q_res, Latency_P95/Throughput (if perf-constrained).
- Interval rules:
- k coverage: U = k·u_c;
- alpha: use t_{ν,1−α/2} or normal approx;
- quantile: e.g., [0.025, 0.975]; choose one mode across the volume.
VI. Scoring Mapping
- Normalization: z_m = ( m − m_baseline ) / σ_baseline (flip sign when “higher is better”).
- Sigmoid score: q_m = 1 / ( 1 + exp( a z_m + b ) ) (default a=1, b=0, adjustable in manifests).
- Aggregate: Q = ( ∑_i w_i q_{m_i} ) / ( ∑_i w_i ); weights w_i fixed & disclosed in bench_plan.yaml.
- Stability check: aggregate Q must trend consistently with key-metric intervals; if not, trigger bias review and re-evaluation.
VII. Gate Mapping & Decisions
- Align thresholds with Error Budget:
- |ΔT_arr| + U(T_arr) ≤ τ_T;
- LB(r_phi) ≥ r_phi_min;
- P95(ε_flux) ≤ ε_flux_guard;
- p_dim = 1.0, Σ PD.
- Release decision: core gates pass and Q ≥ Q_base + δQ_min → Pass; else Fail / [Restricted] (qualitative plots & diagnostics only).
VIII. Normative Path Forms
- Arrival:
T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) or T_arr = ( ∫ ( n_eff / c_ref ) d ell ). - Phase:
Phi = ( 2π / λ_ref ) * ( ∫ n_eff d ell ).
Explicitly show path & measure; record delta_form; all expressions parenthesized.
IX. Machine-Readable
A. bench_plan.yaml
version: "1.0.0"
tasks:
- id: "bench-arrival"
split: "test"
metrics: ["DeltaT_arr_s","Q_res","p_dim"]
coverage: { mode: "k", k: 2 }
- id: "bench-phase"
split: "test"
metrics: ["r_phi","epsilon_flux"]
coverage: { mode: "quantile", p: [0.025, 0.975] }
baseline: { id: "base-001", version: "1.2.3" }
weights: { DeltaT_arr_s: 0.35, r_phi: 0.25, epsilon_flux: 0.15, p_dim: 0.15, Q_res: 0.10 }
B. scorecard.json (example)
{
"version": "1.0.0",
"baseline": { "id": "base-001", "Q": 0.62 },
"method": { "id": "mdl-core", "Q": 0.78 },
"weights": { "DeltaT_arr_s": 0.35, "r_phi": 0.25, "epsilon_flux": 0.15, "p_dim": 0.15, "Q_res": 0.10 },
"metrics": {
"DeltaT_arr_s": { "mean": -2.3e-9, "Uk2": 1.5e-9 },
"r_phi": { "value": 0.72, "lb95": 0.61, "ub95": 0.80 },
"epsilon_flux": { "median": 0.004, "p95": 0.011 },
"p_dim": 1.0,
"Q_res": 0.13
},
"decision": "pass",
"see": ["EFT.WP.Core.Equations v1.1:S20-1","Error Budget Card v1.0:Ch.8"]
}
C. eval_report.md (outline)
# Evaluation Report
- Tasks, splits, seeds
- Metrics with intervals & convergence
- Score mapping, weights, final Q
- Gate comparison & decision
X. Anti-Patterns & Fixes
- Anti: reporting means without intervals → Fix: add U = k·u_c or quantile bands with convergence diagnostics.
- Anti: T_arr = ∫ n_eff / c_ref d ell (missing parentheses) → Fix: parenthesize to normative form.
- Anti: undisclosed weights/modes → Fix: declare in bench_plan.yaml/scorecard.json.
- Anti: temporal/entity/path leakage → Fix: recut per split.yaml and record seed.
XI. Cross-References
- Dataset Card: Ch. 6 (Splits/Versioning), Ch. 11 (Bench/Score).
- Error Budget Card: Ch. 8/9 (intervals & threshold mapping).
- Pipeline Card: Ch. 12 (Outputs & Release).
- This volume: Ch. 6 (Training Protocol), Ch. 7 (UQ) for interval & threshold alignment.
XII. Checklist
- bench_plan.yaml / scorecard.json / eval_report.md generated and aligned with Data/Error volumes.
- Coverage mode unified (k/alpha/quantile); report point estimates + intervals with convergence diagnostics.
- Leakage prevention effective (time/entity/path); split.yaml and seeds recorded.
- Comparison vs τ_T / r_phi_min / ε_flux_guard / p_dim completed; release decision transparent & auditable.
- /validate passed G1–G8; figures dual-exported with units, see[]/version, and coverage notes; non-compliances tagged [Restricted].
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/