53-Model Card Template v1.0 | Chapter 8 — Benchmarks & Comparative Scoring (Bench/Score)

Home ／ Docs-Technical WhitePaper (V6.0) ／ 53-Model Card Template v1.0

Chapter 8 — Benchmarks & Comparative Scoring (Bench/Score)

I. Purpose & Scope

Standardize benchmark tasks and comparative scoring: task definitions, statistical conventions, leakage prevention, and release rules; unify metric intervals, weights, and gate mapping so evaluations are reproducible, auditable, and comparable.
For path-related visualizations/scoring (arrival/phase), the text must explicitly show gamma(ell) and d ell, record delta_form ∈ {general, factored} on the data side; use parenthesized forms; publication requires p_dim = 1.0.

II. Prerequisites & Inputs

Data & splits: align with Dataset Card Ch. 4/6/7/11 (schema/splits/QC/bench); forbid cross-split entity/window mixing.
Training protocol: align with this volume’s Ch. 6 (train_config.yaml, seeds & env snapshot).
Coverage & covariance: unify with Error Budget (coverage ∈ {k, alpha, quantile}, Σ PD).
Citations & versions: “volume + version + anchor (P/S/M/I)”, anchor coverage ≥ 90%; public v1.* only.

III. Bench Tasks & Comparability

Task definition: classification/regression/time-series/path/multimodal; specify input/output fields, units/dimensions, evaluation window.
Contract alignment: for public baselines, list field mappings & differences (bench_plan.yaml); internal benchmarks require fixed splits & seeds.
Repetition & convergence: report point estimates and intervals (k/alpha/quantile) for each metric, with repeat/bootstrap convergence diagnostics.

IV. Leakage Prevention & Consistency

Temporal: monotone TS → {train < val < test} splitting; forbid cross-window feature derivation.
Entity: group_by(entity); ensure entities do not cross splits.
Path: len(gamma_ell)=len(d_ell)=len(n_eff)≥2, Δell ≤ ( c_ref / f_s ) / max(n_eff); align phase in the reference window before computing r_phi.

V. Metrics & Intervals

Primary metrics (examples): AUC, ACC, MAE, RMSE, r_phi, ε_flux, Q_res, Latency_P95/Throughput (if perf-constrained).
Interval rules:
- k coverage: U = k·u_c;
- alpha: use t_{ν,1−α/2} or normal approx;
- quantile: e.g., [0.025, 0.975]; choose one mode across the volume.

VI. Scoring Mapping

Normalization: z_m = ( m − m_baseline ) / σ_baseline (flip sign when “higher is better”).
Sigmoid score: q_m = 1 / ( 1 + exp( a z_m + b ) ) (default a=1, b=0, adjustable in manifests).
Aggregate: Q = ( ∑_i w_i q_{m_i} ) / ( ∑_i w_i ); weights w_i fixed & disclosed in bench_plan.yaml.
Stability check: aggregate Q must trend consistently with key-metric intervals; if not, trigger bias review and re-evaluation.

VII. Gate Mapping & Decisions

Align thresholds with Error Budget:
- |ΔT_arr| + U(T_arr) ≤ τ_T;
- LB(r_phi) ≥ r_phi_min;
- P95(ε_flux) ≤ ε_flux_guard;
- p_dim = 1.0, Σ PD.
Release decision: core gates pass and Q ≥ Q_base + δQ_min → Pass; else Fail / [Restricted] (qualitative plots & diagnostics only).

VIII. Normative Path Forms

Arrival:
T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) or T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
Phase:
Phi = ( 2π / λ_ref ) * ( ∫ n_eff d ell ).

Explicitly show path & measure; record delta_form; all expressions parenthesized.

IX. Machine-Readable
A. bench_plan.yaml

version: "1.0.0"

tasks:

- id: "bench-arrival"

split: "test"

metrics: ["DeltaT_arr_s","Q_res","p_dim"]

coverage: { mode: "k", k: 2 }

- id: "bench-phase"

split: "test"

metrics: ["r_phi","epsilon_flux"]

coverage: { mode: "quantile", p: [0.025, 0.975] }

baseline: { id: "base-001", version: "1.2.3" }

weights: { DeltaT_arr_s: 0.35, r_phi: 0.25, epsilon_flux: 0.15, p_dim: 0.15, Q_res: 0.10 }

B. scorecard.json (example)

{

"version": "1.0.0",

"baseline": { "id": "base-001", "Q": 0.62 },

"method": { "id": "mdl-core", "Q": 0.78 },

"weights": { "DeltaT_arr_s": 0.35, "r_phi": 0.25, "epsilon_flux": 0.15, "p_dim": 0.15, "Q_res": 0.10 },

"metrics": {

"DeltaT_arr_s": { "mean": -2.3e-9, "Uk2": 1.5e-9 },

"r_phi": { "value": 0.72, "lb95": 0.61, "ub95": 0.80 },

"epsilon_flux": { "median": 0.004, "p95": 0.011 },

"p_dim": 1.0,

"Q_res": 0.13

"decision": "pass",

"see": ["EFT.WP.Core.Equations v1.1:S20-1","Error Budget Card v1.0:Ch.8"]

}

C. eval_report.md (outline)

# Evaluation Report

- Tasks, splits, seeds

- Metrics with intervals & convergence

- Score mapping, weights, final Q

- Gate comparison & decision

X. Anti-Patterns & Fixes

Anti: reporting means without intervals → Fix: add U = k·u_c or quantile bands with convergence diagnostics.
Anti: T_arr = ∫ n_eff / c_ref d ell (missing parentheses) → Fix: parenthesize to normative form.
Anti: undisclosed weights/modes → Fix: declare in bench_plan.yaml/scorecard.json.
Anti: temporal/entity/path leakage → Fix: recut per split.yaml and record seed.

XI. Cross-References

Dataset Card: Ch. 6 (Splits/Versioning), Ch. 11 (Bench/Score).
Error Budget Card: Ch. 8/9 (intervals & threshold mapping).
Pipeline Card: Ch. 12 (Outputs & Release).
This volume: Ch. 6 (Training Protocol), Ch. 7 (UQ) for interval & threshold alignment.

XII. Checklist

bench_plan.yaml / scorecard.json / eval_report.md generated and aligned with Data/Error volumes.
Coverage mode unified (k/alpha/quantile); report point estimates + intervals with convergence diagnostics.
Leakage prevention effective (time/entity/path); split.yaml and seeds recorded.
Comparison vs τ_T / r_phi_min / ε_flux_guard / p_dim completed; release decision transparent & auditable.
/validate passed G1–G8; figures dual-exported with units, see[]/version, and coverage notes; non-compliances tagged [Restricted].

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05