46-EFT.WP.Data.Benchmarks v1.0 | Chapter 1 Overview & Scope

Home ／ Docs-Technical WhitePaper (V6.0) ／ 46-EFT.WP.Data.Benchmarks v1.0

Chapter 1 Overview & Scope

I. Chapter Purpose & Scope

Establish the unified posture of benchmarks: how to construct suites, define tasks/subtasks, specify evaluation protocols and metrology, and publish reproducible experiments with leaderboard governance.
Clarify applicability: offline/online/streaming/interactive evaluations; single-model and end-to-end systems; single-dataset and multi-dataset joint evaluations; cross‑modality and cross‑lingual.
Align interfaces and anchors with companion volumes: DatasetCards, ModelCards, Pipeline, and SI units with dimensional checks.

II. Definitions & Terms

Benchmark: a reproducible comparison of targets under given data and protocol.
Suite: an organizational unit composed of tasks and subtasks, with shared protocol, aggregation, and governance rules.
Task/Subtask: an evaluation unit specifying io_mode, input assumptions, constraints, and target metrics.
Track: branches under a task with different resources/tools/openness (e.g., “closed-book/open-book”, “no-tools/tools-allowed”).
Submission: an accepted evaluation run and its artifacts (with run_id, environment lock, and metric report).
Artifact: a verifiable file/object (bound by sha256).
Frozen splits: index‑level immutable sets S_train/S_val/S_test preventing leakage.
Statistical significance: statistical decision on metric differences; report p, CI_95, and correction method.
Path quantities (e.g., arrival time): if T_arr appears, use
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
  and declare gamma(ell) and d ell, with dimensional consistency checks.

III. Background & Motivation

Common pitfalls: uncontrolled leakage, ambiguous protocols, incomparable metrics, irreproducible environments, weak leaderboard governance.
Objective: provide an end‑to‑end framework guided by protocol first, frozen data, unit unification, statistical rigor, and transparent governance.

IV. Design Principles (P01–P05)

P01 Reproducible: inputs, environment, randomness, and implementation lockable; seed and deps_lock are mandatory.
P02 Measurable: all metrics use SI units; composite metrics normalize first and then combine; check_dim=true.
P03 Comparable: fixed protocols, frozen splits, unified aggregation (macro/micro/weighted) and confidence intervals.
P04 Governable: submission workflow, review gates, retraction/correction, and versioning are open and transparent.
P05 Extensible: tasks/data/metrics/protocols evolve via semantic versioning vMAJOR.MINOR.PATCH.

V. In Scope & Out of Scope

In scope: classification/regression/ranking/retrieval/generation/multimodal; offline batch, online A/B, streaming, and interactive evaluation.
Out of scope: optimization of training recipes per se; non‑public protocols bound to sensitive commercial data; datasets that cannot be frozen at index level.
Cross‑volume dependencies:
- Data: see EFT.WP.Data.DatasetCards v1.0.
- Models: see EFT.WP.Data.ModelCards v1.0.
- Pipelines: see EFT.WP.Data.Pipeline v1.0.

VI. Deliverables & Release Gates

Mandatory exports: benchmark.yaml/json, protocol.yaml, metrics.yaml, env.lock, splits/*.index, reports/*.jsonl, each with sha256.
Gates:
- Frozen splits and leakage guardrails enabled;
- SI metrology and dimensional checks pass;
- Significance and uncertainty reports included;
- Privacy, residency, and third‑party processing registered.
Leaderboard governance: stability line, shadow comparisons, submission cooldown, and arbitration process.

VII. Cross‑References & Dependencies

Evaluation protocol & metrics: see EFT.WP.Data.ModelCards v1.0, Chapter 11.
Performance, cost & scaling: see EFT.WP.Data.Pipeline v1.0, Chapter 13.
Units & dimensions: see EFT.WP.Core.Metrology v1.0:check_dim.
Fixed cross‑volume phrasing example: “See companion white paper Energy Threads, Chapter x, S/P/M/I…”.

VIII. Machine‑Readable Overview (Normative)

suite:

id: "eift.benchmarks.core"

title: "EIFT Core Benchmarks"

version: "v1.0.0"

modalities: ["text","image","audio"]

risks: ["leakage","bias","spurious_correlation"]

tasks:

- id: "cls.binary"

io_mode: "offline"

tracks: ["closed-book"]

dataset_ref: "datasets/core_cls@v1.0"

sampling: {strategy:"stratified", strata:[{by:"label"}]}

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

leakage_guard: ["per-object","per-scene"]

protocol:

seed: 1701

repeats: 5

temperature: 0.0

tools_allowed: false

runtime_limits: {timeout_s: 3600}

metrics:

- {name:"Acc", unit:"—", higher_is_better:true, agg:"macro"}

- {name:"ECE", unit:"—", higher_is_better:false}

aggregation:

levels: ["task","suite"]

weights: {task:"uniform"}

normalize: {scheme:"zscore", anchors:["baseline.logreg","baseline.rf"]}

significance:

method: "bootstrap"

B: 10000

alpha: 0.05

correction: "Holm-Bonferroni"

env:

hardware: {cpu:"16c", mem_gb:64, gpu:0}

os: "ubuntu-22.04"

containers: ["ghcr.io/eift/runner@sha256:<hex>"]

deps_lock: "env.lock"

baselines:

- {id:"baseline.logreg", impl:"I15-1.logreg", params:{C:1.0}}

- {id:"baseline.rf", impl:"I15-2.rf", params:{n_trees:200}}

export_manifest:

version: "v1.0"

artifacts:

- {path:"benchmark.yaml", sha256:"<hex>"}

- {path:"splits/train.index", sha256:"<hex>"}

- {path:"reports/summary.json", sha256:"<hex>"}

references:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

IX. Lint Rules (Excerpt, Normative)

lint_rules:

- id: SUITE.ID_FORMAT

when: "$.suite.id"

assert: "matches('^[a-z0-9_.\\-]+$')"

level: error

- id: SPLITS.FROZEN_REQUIRED

when: "$..splits"

assert: "train.frozen == true and val.frozen == true and test.frozen == true"

level: error

- id: LEAKAGE.GUARDS

when: "$..leakage_guard"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: METRICS.UNITS_SI

when: "$..metrics[*].unit"

assert: "all_units_in_SI(value) or value == '—'"

level: error

- id: PROTOCOL.SEED_AND_REPEATS

when: "$..protocol"

assert: "has_keys(seed, repeats)"

level: error

- id: SIGNIFICANCE.PARAMS

when: "$..significance"

assert: "has_keys(method, B, alpha)"

level: error

- id: EXPORT.REFERENCES_FORMAT

when: "$.export_manifest.references[*]"

assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"

level: error

X. Chapter Compliance Checklist

Conceptual posture and terminology unified; definitions of suite/tasks/subtasks/tracks are clear.
Principles P01–P05 are actionable and aligned with anchors in DatasetCards/ModelCards/Pipeline.
Applicability and boundaries are explicit; cross‑volume dependencies resolve.
Deliverables and gates are verifiable (sha256, frozen splits, SI units, significance, compliance materials).
Machine‑readable fragment is drop‑in; lint rules are enforceable as blocking checks in portal/CI.

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05