HomeDocs-Technical WhitePaper46-EFT.WP.Data.Benchmarks v1.0

Chapter 7 Evaluation Protocol (Offline/Online/Streaming/Interactive)


I. Chapter Purpose & Scope

specification across offline/online/streaming/interactive modes: randomness & reproducibility, track & tool constraints, context & prompting, concurrency & rate, streaming windows & interaction rounds, A/B & shadow traffic, logging & metric reporting; ensure consistency with task definition, metric system, frozen splits, metrology, and citation anchors.evaluation protocolFix the

II. Terminology & Dependencies

  1. Terms: mode (offline/online/stream/interactive), seed, repeats, temperature, context_length, rounds, canary/shadow, traffic_allocation, caching, tools_allowed, retrieval/open_book, runtime_limits, concurrency, rate_limit.
  2. Dependencies: frozen splits & distribution (DatasetCards v1.0, Ch.11), evaluation & aggregation (ModelCards v1.0, Ch.11), monitoring & online windows (Pipeline v1.0, Ch.12), units & dimensions (Core.Metrology v1.0:check_dim).
  3. Math & symbols: wrap inline symbols; any division/integral/composite operator must use parentheses; for path quantity T_arr use
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
      declaring gamma(ell) and d ell. No Chinese in formulas/symbols/definitions.

III. Fields & Structure (Normative)

protocol:

mode: "offline|online|stream|interactive"

seed: 1701

repeats: 5

temperature: 0.0

max_tokens: 0

context:

length: 4096

template_ref: "prompts/<id>@vX.Y"

tools:

allowed: false

retrieval: false

open_book: false

registry_ref: null

runtime_limits:

timeout_s: 3600

memory_gb: 16

execution:

concurrency: 8

rate_limit_qps: 50

batching: {enabled: true, max_batch: 32}

caching: {enabled: false, policy: "none|warm|full"}

stream:

window_ms: 1000

hop_ms: 250

max_latency_ms: 200

watermark: "event_time|processing_time"

interactive:

rounds: 3

turn_timeout_s: 30

max_context_turns: 8

online:

traffic_allocation: {control: 0.5, treatment: 0.5}

exposure: {shadow: true, canary: 0.05}

guardrails: ["latency_ms.p99<=200","error_rate<=0.01"]

logging:

format: "jsonl"

fields: ["ts","task_id","item_id","run_id","trace_id","input_hash","output_hash","latency_ms","error_code"]

retention: "P30D"

reporting:

metrics: ["F1_macro","ECE","latency_ms.p99","QPS"]

target_ci: {method: "bootstrap", level: 0.95}

see:

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"


IV. Protocol Modes


V. Tracks & Resource Constraints


VI. Statistics & Significance


VII. Machine-Readable Fragments (Drop-in)

# Offline protocol example

protocol:

mode: "offline"

seed: 1701

repeats: 5

temperature: 0.0

context: {length: 4096, template_ref: "prompts/qa_v1@v1.0"}

tools: {allowed: false, retrieval: false, open_book: false}

runtime_limits: {timeout_s: 3600, memory_gb: 16}

execution: {concurrency: 8, rate_limit_qps: 50, batching:{enabled:true, max_batch:32}}

logging: {format:"jsonl", fields:["ts","task_id","item_id","run_id","latency_ms"], retention:"P30D"}

reporting: {metrics:["F1_macro","ECE"], target_ci:{method:"bootstrap", level:0.95}}

see: ["EFT.WP.Data.ModelCards v1.0:Ch.11","EFT.WP.Core.Metrology v1.0:check_dim"]

# Online protocol example

protocol:

mode: "online"

seed: 1701

repeats: 1

online:

traffic_allocation: {control: 0.5, treatment: 0.5}

exposure: {shadow: true, canary: 0.05}

guardrails: ["latency_ms.p99<=200","error_rate<=0.01"]

execution: {concurrency: 64, rate_limit_qps: 500, batching:{enabled:false}}

logging: {format:"jsonl", fields:["ts","trace_id","latency_ms","error_code"], retention:"P30D"}

reporting: {metrics:["QPS","latency_ms.p99"], target_ci:{method:"t", level:0.95}}

see: ["EFT.WP.Data.Pipeline v1.0:Ch.12","EFT.WP.Core.Metrology v1.0:check_dim"]


VIII. Lint Rules (Excerpt, Normative)

lint_rules:

- id: PROTOCOL.MODE_ALLOWED

when: "$.protocol.mode"

assert: "value in ['offline','online','stream','interactive']"

level: error

- id: PROTOCOL.SEED_REPEATS

when: "$.protocol"

assert: "has_key(seed) and (mode != 'online' -> has_key(repeats))"

level: error

- id: PROTOCOL.FROZEN_SPLITS_REQUIRED

when: "$.splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen"

level: error

- id: PROTOCOL.TOOLS_TRACK_CONSISTENCY

when: "$.protocol.tools"

assert: "value.allowed == false or has_key($.tracks)"

level: error

- id: ONLINE.TRAFFIC_SUM

when: "$.protocol.online.traffic_allocation"

assert: "abs(value.control + value.treatment - 1) <= 1e-6"

level: error

- id: STREAM.WINDOW_PARAMS

when: "$.protocol.mode == 'stream'"

assert: "has_keys($.protocol.stream.window_ms, $.protocol.stream.hop_ms, $.protocol.stream.max_latency_ms)"

level: error

- id: INTERACTIVE.ROUNDS_DEFINED

when: "$.protocol.mode == 'interactive'"

assert: "has_keys($.protocol.interactive.rounds, $.protocol.interactive.turn_timeout_s)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


IX. Cross-Reference Anchors


X. Chapter Compliance Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/