56-Report-Level Methods Appendix Template v1.0 | Chapter 8 Evaluation Protocol & Metrics

Home ／ Docs-Technical WhitePaper (V6.0) ／ 56-Report-Level Methods Appendix Template v1.0

Chapter 8 Evaluation Protocol & Metrics

I. Chapter Goals & Scope (Mandatory)

Fix a unified caliber for evaluation protocol—metric definitions—gates—statistics—observation windows—pass lines, ensuring results are comparable, replayable, auditable.
Aligned with Ch.5 (Data & Experimental Design), Ch.6 (Math & Pseudocode), Ch.7 (Metrology & Calibration), and Ch.11 (Implementation Binding).

II. Protocol Structure & General Requirements (Mandatory)

Minimal elements: target under test, data partitions, metric list, gate thresholds, statistical method & intervals, script locators (script@commit), exported results & hashes.
Observation window: ISO8601 with UTC; rolling evaluations must specify step/overlap.
Repeatability: fixed random seeds; cross-environment reruns consistent (container image and locked dependencies).

III. Metric Definitions & Direction (Mandatory)

Naming: metric_name; direction uses arrows — ↑ higher-is-better, ↓ lower-is-better, ≈ closer-is-better.
Common metrics (examples):
- gate_accuracy (↑) — vs analytic or fine-grid baseline;
- gate_latency (↓) — end-to-end latency for batch/stream;
- compat_rate (↑) — replay/production compatibility pass rate;
- error_rate (↓) — error/exception ratio;
- cal_residuals (↓) — calibration residuals (see Ch.7).
If arrival-time criteria are involved, restate the caliber in the same paragraph:
- Factored: T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- General: T_arr = ( ∫ ( n_eff / c_ref ) d ell )
  and explicitly declare path gamma(ell) and measure d ell, with check_dim=true.

IV. Gates & Pass Lines (Mandatory)

Naming rule: gate_<metric><comparator><threshold>@<window>, e.g., gate_accuracy>=0.99@7d, gate_latency<=2h@7d, compat_rate>=0.995@replay.
Layers: hard gates (failure blocks) and soft gates (may pass with signed clauses per Decision & Sign-off).
Statistical caliber: specify method (e.g., bootstrap), CI_95%, sample size, sampling strategy; record raw and aggregated measures.

V. Execution & Recording (Mandatory)

Inputs to record: versioned data/model/config, run parameters, random seeds, numeric accuracy params such as Δell/ε_int, environment hashes.
Exports: yaml/json/pdf artifacts containing raw result summaries + gate decisions + confidence intervals + script locators + artifact hashes.
Baseline policy: name Option Base; tabulate per-metric and overall comparisons vs candidates.

VI. Benchmarks & Controls (Mandatory)

Benchmark sources: analytic solutions, ultra-fine-grid computations (smaller Δell), authoritative public benchmarks; state applicability and max deviation.
Control dimensions: performance/cost/time/risk/dependency/reproducibility/compatibility; default weights and discount policy per Ch.5 & Ch.6.

VII. Evaluation Checklist Template (copy-ready)

Dimension	Metric	Dir.	Gate	Window	Stats/CI	Script	Notes
Performance	gate_accuracy	↑	>=0.99	@7d	bootstrap/95%	eval.py@a1b2c3	Analytic or fine-grid baseline
Latency	gate_latency	↓	<=2h	@7d	quantiles/mean	runner.py@9f8e7d	Batch
Compatibility	compat_rate	↑	>=0.995	@replay	binomial/95%	replay.sh@d4e5f6	Prod replay
Metrology	cal_residuals	↓	<=3σ	@validation	normal assumption	calib.py@c0ffee	See Ch.7
Errors	error_rate	↓	<=1e-3	@24h	Poisson/Binomial	monitor@abcd12	Online

VIII. Machine Structure (YAML; JSON-equivalent, Mandatory)

evaluation:

window: { start: "2025-09-01T00:00:00Z", end: "2025-09-07T23:59:59Z", timezone: "UTC" }

metrics:

- { name: "gate_accuracy", direction: "↑", desc: "vs. analytic or fine-grid baseline" }

- { name: "gate_latency", direction: "↓", desc: "E2E latency for batch/stream" }

- { name: "compat_rate", direction: "↑", desc: "replay/prod compatibility pass rate" }

- { name: "cal_residuals", direction: "↓", desc: "calibration residuals" }

- { name: "error_rate", direction: "↓", desc: "operational error rate" }

gates:

hard: ["gate_accuracy>=0.99@7d","compat_rate>=0.995@replay","gate_latency<=2h@7d"]

soft: ["unit_cost<=1.0x@30d"]

stats:

method: "bootstrap"

ci: "95%"

samples: 1000

baseline:

name: "Option Base"

reference:

type: "analytic|fine_grid"

script: "fine_grid.py@bead55"

artifacts: ["yaml","json","pdf"]

scripts:

eval: "eval.py@a1b2c3"

runner: "runner.py@9f8e7d"

replay: "replay.sh@d4e5f6"

arrival_time:

caliber:

forms:

- { name: "general", expr: "( ∫ ( n_eff / c_ref ) d ell )" }

- { name: "factored", expr: "( 1 / c_ref ) * ( ∫ n_eff d ell )" }

path: "gamma(ell)"

measure: "d ell"

check_dim: true

IX. Human × Machine Mapping (Mandatory)

Human Section	Machine Field	Validation Focus
Protocol elements	evaluation.window, evaluation.scripts.*	Window/scripts/environment explicit
Metrics & direction	evaluation.metrics[]	↑/↓/≈ consistent, clear descriptions
Gates & thresholds	evaluation.gates.hard/soft	Naming/comparator/window unified
Statistics & intervals	evaluation.stats.*	Method/CI/sample size complete
Benchmark & controls	evaluation.baseline.*	Benchmark type and script locator
Arrival-time caliber	arrival_time.caliber.*	Two forms + path/measure + check_dim

X. Validation Rules (regex/consistency, Mandatory)

Gate expression: ^gate_[a-z0-9_]+(>=|<=|==)[^@\\s]+@[^\\s]+$; compat_rate>=0.995@replay accepted equivalently.
Time window: start ≤ end and timezone="UTC".
Metric direction: direction ∈ {↑, ↓, ≈}.
Arrival-time: if evaluation involves T_arr, arrival_time.caliber.path/measure must exist and check_dim=true.

XI. Citation & Cross-Reference Style (Mandatory)

; all EFT.WP.* citations must include explicit version and anchor, with a machine-readable list in references.see[].“See 《 vX.Y》 Ch.x S/P/M/I…”Fixed format:

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05