12-EFT.WP.Methods.Repro v1.0 | Chapter 8 Benchmark Suite and Scoring

Home ／ Docs-Technical WhitePaper ／ 12-EFT.WP.Methods.Repro v1.0

Chapter 8 Benchmark Suite and Scoring

I. Scope and Objectives

Define the benchmark-suite structure, execution gauges, and scoring functions required for reproducibility assessment, providing unified, comparable metrics and gates across sites and versions.
Outputs include:
- Benchmark case types and manifest format, covering a minimal trio: deterministic / statistical / spectral.
- Score synthesis and gate system with core measures delta_rep, R_coef, delta_psd, R_spectrum, r_tb.
- Run and sampling rules, confidence evaluation, report fields, and the binding flow to PipelineCard / ParamCard.

II. Terms and Symbols

delta_rep = ( norm( y_new - y_ref ) / max( norm( y_ref ), eps_floor ) ) — relative difference for reproducibility results.
R_coef = 1 - delta_rep — reproducibility coefficient.
S_xx(f) — power spectral density; U_w — window energy; ENBW — equivalent noise bandwidth.
delta_psd — spectral discrepancy (see S32-21 in this chapter); R_spectrum = 1 - delta_psd.
r_tb — time-base residual; alpha, beta satisfy ts = alpha + beta * tau_mono.
gate.rep — reproducibility gate; tau_psd — spectral gate; tau_tb — time-base gate.
Collision names: T_fil vs T_trans must not be mixed; n vs n_eff must be strictly distinguished; if T_arr is declared, compute both gauges in parallel and publish delta_form.

III. Postulates and Minimal Equations

P31-15 Portable benchmark postulate
With EnvLock effective and PipelineCard/ParamCard matched, cross-site runs produce benchmark statistics that are identically distributed, with deviations constrained by delta_rep and delta_psd.
S32-20 Primary score synthesis
- score = w1 * R_coef + w2 * R_spectrum + w3 * R_timebase + w4 * R_stability
- where w1 + w2 + w3 + w4 = 1, and
  R_timebase = max( 0 , 1 - ( r_tb / tau_tb ) );
  R_stability = pass_rate( checks ) (see Section VI), valued in [0,1].
S32-21 Spectral discrepancy (log-domain L2 gauge)
- Let L_x(f) = 10 * log10( S_xx(f) ), L_y(f) = 10 * log10( S_yy(f) ), with weighting w(f) satisfying ( ∫ w(f) d f ) = 1, then
  delta_psd = ( ∫ w(f) * ( L_x(f) - L_y(f) )^2 d f )^(1/2).
- Recommended w(f) ∝ ( W(f)^2 / ENBW ), where W(f) is the window frequency response.
S32-22 Confidence evaluation and gate decision
With bootstrap upper bounds delta_rep^+ and delta_psd^+, the pass condition is
delta_rep^+ <= gate.rep and delta_psd^+ <= tau_psd and r_tb <= tau_tb.

IV. Benchmark Suite and Manifest Gauges

Minimal benchmark set (≥ 1 case per class)
- Deterministic
  Fixed seed and inputs; expect delta_rep → 0. Cover boundary conditions, path gamma(ell), and deterministic replay of the operator chain.
- Statistical
  RNG enabled; compare distributional statistics (mean, variance, quantiles), evaluate the distribution of delta_rep and the confidence interval of R_coef.
- Spectral
  For time series or fields, assess S_xx(f) consistency and compute delta_psd and R_spectrum; the window must report U_w and ENBW.
Benchmark manifest fields (minimal ingestion set)
benchmark.id, schema.version, category ∈ {deterministic, statistical, spectral}, inputs, reference.uri/hash, window, rng, paths:[ gamma(ell) ] (when path integrals are involved),
metrics = { delta_rep, R_coef, delta_psd, R_spectrum, r_tb }, gates = { gate.rep, tau_psd, tau_tb }, notes.
Dual arrival-time gauges (when a case declares T_arr)
Publish in parallel
T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) and T_arr = ( ∫ ( n_eff / c_ref ) d ell ), and include delta_form with the path gamma(ell) and measure d ell.

V. Algorithms and Implementation Bindings

I30-3 run_benchmark_suite(card:dict) -> BenchReport
- Parse the benchmark manifest and RunPlan; execute each case N times (statistical class requires N >= N_min).
- Collect outputs; compute delta_rep, R_coef; call I30-7 to compute delta_psd.
- Estimate r_tb (see Chapter 6); produce per-case reports and synthesized score.
I30-4 verify_reproduction(golden:any, candidate:any, metrics:dict) -> RepReport
- Align time bases ts = alpha + beta * tau_mono.
- Compute confidence upper bounds for delta_rep and delta_psd.
- Output pass:bool with gate comparisons.
I30-7 compare_psd(x:any, y:any, window:dict) -> { delta_psd:float, pass:bool }
- Estimate S_xx(f) and S_yy(f); audit U_w and ENBW.
- Compute delta_psd per S32-21 and compare to tau_psd.

VI. Metrology Flows and Run Graph

Mx-42 benchmark-plan
- Select the benchmark manifest and bind it to the PipelineCard.
- Audit EnvLock and ParamCard.
- Generate a RunPlan and preset TS.* observation points.
Mx-43 execute-and-measure
- Execute each case, recording TS.* and output traces.
- Compute delta_rep, R_coef, delta_psd, r_tb.
- Run the stability check set checks = { monotonicity, bounds, mass_conservation, jitter }, yielding R_stability = pass_rate( checks ).
Mx-44 score-and-publish
- Synthesize score and confidence intervals; decide pass.
- Produce a signed BenchReport and ingest it for archival.

VII. Verification and Test Matrix

Minimum required
- Deterministic replay: same seed and inputs; require delta_rep <= gate.rep.
- Spectral consistency: fixed window; delta_psd <= tau_psd, and verify var( x ) ≈ ( ∫ S_xx(f) d f ).
- Time-base gate: r_tb <= tau_tb.
Boundary and extreme cases
- Extreme window leakage (degenerate U_w or anomalous ENBW) must trigger E_WINDOW_INVALID (thrown by the binding).
- Low-SNR segments should harden w(f) (e.g., band masking) and list the masked bands in the report.
Statistical power
With single-run variance estimate s^2, choose the sample size N such that
N >= ceil( z_{1-β}^2 * s^2 / tau^2 ), where tau is the target precision and β the type-II error cap.

VIII. Report Fields and Visualization

report = { suite.id, schema.version, runs:[ ... ], metrics:{ delta_rep, R_coef, delta_psd, R_spectrum, r_tb, R_timebase, R_stability, score }, gates:{ gate.rep, tau_psd, tau_tb }, confint:{ delta_rep:[lo,hi], delta_psd:[lo,hi] }, windows:{ U_w, ENBW, w(f) }, timebase:{ alpha, beta }, arrival:{ delta_form?, paths? } }
Visualization suggestions
- Error-bar comparisons for R_coef and R_spectrum.
- Overlaid log-domain curves L_x(f) and L_y(f) with residuals.
- Scatter of pre-/post-alignment times with linear-fit residual distributions.

IX. Cross-References and Dependencies

Core.Metrology: S_xx(f) estimation, window U_w and ENBW, uncertainty gauges.
Core.Threads: runtime TS.* observation, hb and bp.
Chapter 5 EnvLock; Chapter 6 time-base and randomness; Chapter 7 PipelineCard / ParamCard; Chapter 12 acceptance and publication.

X. Risks, Limits, and Open Questions

Risks
Band selection biasing scores; poor windows introducing spectral leakage; underestimation of delta_psd due to cross-device minute time-base drifts.
Limits
High-dimensional outputs may require blockwise or featureized delta_rep to avoid masking local mismatches; w(f) empirical choices are sensitive.
Open questions
Learning adaptive weights w1..w4 and domain adaptation; standard frequency bands for cross-domain spectral benchmarks; multi-path fusion of dual T_arr gauges with attribution of delta_form.

Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published： 2025-11-11｜Current version：v5.1
License link：https://creativecommons.org/licenses/by/4.0/