14-EFT.WP.Methods.Inference v1.0 | Chapter 8: Performance Metrics & SLO

Home ／ Docs-Technical WhitePaper (V6.0) ／ 14-EFT.WP.Methods.Inference v1.0

Chapter 8: Performance Metrics & SLO

I. Scope & Objectives

Unify the object model, computation conventions, and publishing format for inference performance metrics and Service-Level Objectives (SLOs), covering offline load tests and online observability, single-instance and distributed inference, and CPU/GPU/accelerator modalities.
Provide reusable score synthesis, gate.slo, and error-budget allocation, ensuring parallel executability with Chapter 6 (online/offline consistency) and Chapter 7 (calibration gates).
Target outputs
- Metrics & conventions: TS.latency_{p50,p95,p99}, TS.thrpt, TS.error, tail_ampl, cost_u, R_infer.
- SLO spec: SLO = { name, sli, target, window, objective, budget }.
- Score synthesis: score = Σ w_k * s_k, with ScoreReport and SLOReport.
- Metrology flow: Mx-47 → Mx-52.

II. Terms & Symbols

Metrics & decompositions
- Latency breakdown: TS.lat_total = TS.lat_io + TS.lat_queue + TS.lat_sched + TS.lat_model.
- Throughput: TS.thrpt = N_req / W; concurrency approximation: WIP ≈ TS.arrival_rate * E[T].
- Tail amplification: tail_ampl = TS.latency_p99 / TS.latency_p50.
- Availability: avail = 1 - ( N_err / N_req ), where N_err includes timeout, 5xx, policy_denied.
SLI / SLO / SLA
SLI is an observable (e.g., TS.latency_p99); SLO is the target (e.g., TS.latency_p99 <= L_target over window W); SLA (external contract) is out of scope in this volume.
Cost & budgets
- Unit cost: cost_u = ( cost_cpu + cost_gpu + cost_mem + cost_io + cost_net ) / N_req.
- Resource budgets: budget.cpu/gpu/mem/power; error budget: budget.err = 1 - target.avail.
Normalization & scoring
- Linear downwards normalization: norm_down(x; a,b) = clamp( ( b - x ) / ( b - a ), 0, 1 ) (smaller is better).
- Linear upwards normalization: norm_up(x; a,b) = clamp( ( x - a ) / ( b - a ), 0, 1 ) (larger is better).

III. Postulates & Minimal Equations

P41-21 Invariant observability postulate
With EnvLock locked and the aggregator fixed, the computation convention of the same SLI is equivalent offline/online: SLI_off ≡ SLI_on.
P41-22 Multi-objective monotonicity postulate
If any sub-metric s_k improves (others unchanged), the overall score does not decrease: ∂score/∂s_k >= 0.
S42-31 Score synthesis
score = w_acc * acc + w_cal * ( 1 - ECE_norm ) + w_lat * ( 1 - lat_p99_norm ) + w_thr * thrpt_norm + w_cost * ( 1 - cost_u_norm ) + w_cons * R_infer, with Σ w_* = 1.
lat_p99_norm = norm_down( TS.latency_p99; L_target, L_worst );
thrpt_norm = norm_up( TS.thrpt; QPS_min, QPS_goal );
cost_u_norm = norm_up( cost_u; C_min, C_max );
ECE_norm = norm_up( ECE; 0, ECE_max ).
S42-32 SLO decision & error budget
- Latency-type: pass_lat = 1[ TS.latency_p99 <= L_target ].
- Availability-type: pass_avail = 1[ avail >= A_target ].
- Budget consumption: budget.used = violations / opportunities, with violations = Σ 1[ SLI_i fails ].
S42-33 Cost model
- cost_cpu = price_cpu * cpu_time; cost_gpu = price_gpu * gpu_time;
  cost_mem = price_mem * mem_GB * time; similarly for cost_io/net.
- cost_u = ( cost_cpu + cost_gpu + cost_mem + cost_io + cost_net ) / N_req.
S42-34 Queue consistency & Little’s-law approximation
WIP ≈ λ * E[T], where λ = TS.arrival_rate, E[T] = TS.latency_p50, for capacity and backpressure checks.

IV. Data & Manifest Conventions

Per-request minimal observability fields
- ts_start, ts_end, route, batch_size, device, dtype_policy, quant_scheme, status, bytes_in/out, retries, cold_start, z_logit_opt.
- Resource samples: cpu_pct, gpu_util, mem_GB, power_W, sm_occupancy, bw_in/out.
- Bucketing & aggregation: hist.latency (supports kll/tdigest), window W, step Δt.
Convention consistency
Measure all latency on tau_mono and map to ts: ts = alpha + beta * tau_mono. Use the same quantile approximator and compression parameters for percentiles.
Cost conventions
Declare unit-price baselines and currency. For mixed tenancy, record share_ratio to apportion cost_mem and cost_net.

V. Algorithms & Implementation Bindings

Prototypes
- I40-11 compute_sli(stream:any, spec:dict) -> SLIReport
- I40-12 compose_score(sli:dict, weights:dict) -> ScoreReport
- I40-13 plan_capacity(target:dict, priors:dict) -> Plan
- I40-10 compare_offline_online(off:any, on:any, policy:dict) -> ConsistencyReport
compute_sli highlights
Maintain TS.latency_{p50,p95,p99} via kll/tdigest; windowed aggregation over W with step Δt; slice by route/device.
compose_score highlights
Normalize per S42-31 and synthesize; return the overall score, per-dimension s_k, gate.slo, and sensitivities ∂score/∂s_k.
plan_capacity
Produce the feasible region over ( λ, batch_size, replica ) satisfying pass_lat ∧ pass_avail ∧ cost_u <= C_cap. If infeasible, return E_RESOURCE_EXCEEDED.

VI. Metrology Flows & Run Diagram (Mx-47 → Mx-52)

Mx-47 SLI architecture & baseline
Define SLIs and aggregator parameters; compute offline baselines for TS.latency_* / TS.thrpt / cost_u / avail.
Mx-48 Load testing & capacity probing
Sweep load λ and grid over batch_size; extract the (p99, tail_ampl, thrpt) surface; record the Plan and backpressure thresholds.
Mx-49 Score synthesis & gate setting
Set weights w_* and thresholds tau_score, L_target, A_target, C_cap; produce ScoreReport and SLOReport.
Mx-50 Canary rollout & real-time observability
Deploy to canary; compute SLIs per Δt. Triggers: if budget.used > τ_budget or tail_ampl > τ_tail, then degrade & rollback.
Mx-51 Budget governance & self-healing
Apply priority/ratelimiting/circuit-breaker/batching strategies to control budget.used; scale out progressively once in-gate.
Mx-52 Archival & audit
Archive SLIReport/ScoreReport/Plan, aggregator fingerprint, and configuration; publish verifiable signatures.

VII. Verification & Test Matrix

Percentile accuracy: on synthetic latency streams with known distributions, compare approximated and true percentiles; require |p99_hat - p99_true| <= ε_p.
Tail sensitivity: inject queueing and cold-start scenarios; tail_ampl should rise with λ; identify thresholds and rollback triggers.
Capacity plan replay: deploy per Plan; observe pass_lat ∧ pass_avail holding over 95% of windows.
Cost capping: schedule under C_cap; cost_u must remain in-bounds; out-of-bounds yields E_RESOURCE_EXCEEDED.
Consistency cross-check: ensure R_infer >= τ_cons and differences in TS.latency_* lie within Δlat_allow, honoring Chapter 6’s consistency contract.

VIII. Cross-References & Dependencies

Shares TS.*, R_infer, rollback, and canary orchestration with Chapter 6; shares ECE_norm and calibration gates in the score with Chapter 7; aligns with scoring and publication conventions in EFT.WP.Methods.Repro Chapter 8; adheres to hb/bp/makespan/critical path semantics from Core.Threads.

IX. Risks, Limitations & Open Questions

Observability bias: sampling and window choices can bias estimates; disclose Δt/W and approximator parameters in SLIReport.
Multi-tenant crosstalk: shared resources cause TS.latency_p99 jitter; require isolation or quotas to stabilize tail_ampl.
Metric alignment: node clock skew affects end-to-end TS.lat_total; first align time via ts = alpha + beta * tau_mono.
Cost attribution: cross-service pipeline cost sharing needs a unified share_ratio; otherwise cost_u is not comparable.

X. Deliverables & Versioning

Deliverables
- SLOSpec.yaml (SLO definitions and budgets);
- SLIReport.json (windowed statistics with approximator parameters across dimensions);
- ScoreReport.json (score, s_k, sensitivities and gates);
- Plan.yaml (capacity plan and rollback thresholds);
- Audit bundle (aggregator fingerprint, signatures, and release fingerprints).
Versioning policy
- Changes to SLI definitions, aggregators or window W/Δt, w_*, or any target/budget must bump the minor version and be recorded in Appendix C.
- If the scoring structure or cost-model terms change, bump the major version and update
  fingerprint = hash( SLOSpec || ScoreSpec ).

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05