Home / Docs-Technical WhitePaper / 07-EFT.WP.Core.Threads v1.0
Chapter 8 — Observability and SLO
I. Scope and Objectives
- Establish a unified specification for observability and SLOs that spans metric_emit, trace_span, trace_link, structured logs, and alert gating, in service of performance and reliability evaluation for concurrent execution graphs G=(V,E).
- Define postulates P78-, minimal equations S78-, and the operational flow Mx-7, aligned with I70-7 and I70-8, and coordinated with Chapters 3/5/7 on chan/bp, timeout/retry, and rate_limiter.
- Map SLOs to error budgets and rollback gates; within SLA_window require rho < 1 approx stability, P99 latency, and error rate targets to hold.
II. Terms and SLI Families
- Indicator families: SLI (service level indicator), SLO (target), SLA_window (evaluation window), EB (error budget).
- Event identities: eid, pid_thr, gid, chan, idemp_key; clocks tau_mono (runtime metrics) and ts (audit).
- Primary SLI dimensions
- Availability: SLI_avail = Good / Total.
- Latency: P50/P90/P99, W, W_q, W_service.
- Quality / Errors: ErrRate = 1 - SLI_avail; semantic success rate SemOK / Total.
- Throughput & saturation: QPS, q_len, cap, bp, rho = lambda / mu.
III. Postulates P78 (Instrumentation, Labels, and Time)
- P78-1 (Monotone timing): compute latencies, quantiles, queueing, and limiting with tau_mono; publish/audit in ts.
- P78-2 (Label discipline): restrict metric label set to {gid, pid_thr, chan, endpoint, prio}; forbid high-cardinality free text; logs must carry eid and be joinable to trace_span.
- P78-3 (Histogram consistency): fix bucket boundaries for latency histograms so upgrades preserve comparability; derive all Pxx from the same bucket family.
- P78-4 (Predeclared “Good”): define “Good” explicitly (e.g., 2xx and sem_ok and hb respected and no dedup_fail).
- P78-5 (Hierarchical conservation): aggregated metrics satisfy sum(child) <= parent + epsilon_window; attribute window deltas to sampling and clock drift epsilon_window.
- P78-6 (Causal ordering): connect cross-thread/channel metrics and traces via trace_link(span, eid) to enforce hb; forbids racy overwrites of a final state for the same eid.
- P78-7 (Minimal disturbance): bound the observability system’s own R_cpu/R_mem/R_io; control metric/log sampling rates by quotas to avoid observer-induced backpressure.
IV. Minimal Equations S78 (SLOs and Budgets)
- S78-1 (Availability SLI): SLI_avail = Good / Total, with “Good” per P78-4.
- S78-2 (Latency SLI): with target L_obj, SLI_lat = count( latency <= L_obj ) / Total.
- S78-3 (Error budget): SLO = 1 - e_target, EB = e_target; observed error e_obs = 1 - SLI_avail.
- S78-4 (Budget consumption & burn): Burn = e_obs / EB; Burn_rate = ( e_obs / window ) / ( EB / SLA_window ).
- S78-5 (Multi-window guard): only allow ramp-up or release when both Burn_rate(w1) < b1 and Burn_rate(w2) < b2.
- S78-6 (Path availability, independence approx): for the critical path crit(G), A_path approx ∏ A_i (i in crit(G)).
- S78-7 (Class-weighted aggregation): with weights w_k over traffic classes, SLI* = ∑ w_k * SLI*_k, ∑ w_k = 1.
V. Metric Design (Names and Examples)
- Counters
- threads.qps{endpoint,prio}, threads.req_total{...}, threads.good_total{...}.
- chan.admit_total{chan}, chan.drop_total{chan,reason}.
- Histograms
- threads.latency_ms_bucket{endpoint} (fixed buckets), used to derive P50/P90/P99.
- chan.wait_ms_bucket{chan} (queue wait).
- Gauges
chan.q_len{chan}, bp.level{chan}, lim.tokens{lim}, rho{service}. - Quality
threads.err_total{code,reason}, sem.ok_total{rule}.
VI. Tracing and Correlation (I70-7)
- Basics
- Create a root span per request: span = trace_span("req", attrs={gid,pid_thr,endpoint}).
- On thread hops or channel handoffs call trace_link(span, eid); mark ACK with attrs={"ack":true}.
- Key events
Enqueue: eid_in; Dequeue: eid_out; Retry: eid_retry (with attempt); Idempotency hit: idemp_key. - Sampling
Baseline p_sample; force-retain when latency > L_obj or on error; boost sampling for soft-429/limited requests from Chapter 7.
VII. Logging (Structured and Privacy-aware)
- Structured shape: {"ts":ts,"eid":eid,"gid":gid,"pid_thr":pid_thr,"event":"...","fields":{...}}.
- Apply mask_fields / anonymize from Core.DataSpec to red-line fields; raw sensitive identifiers are forbidden.
- Retention and sampling: align with SLA_window and audit tier; jitter-sample noisy high-frequency events.
VIII. Alerting and Rollback Gates
- Multi-window burn: suggest b1 ∈ [2,6] (short-window agile) and b2 ∈ [1,2] (long-window stable).
Trigger when Burn_rate(w1) >= b1 or Burn_rate(w2) >= b2. - Latency gate: if P99 > L_obj*(1+alpha) for ≥ w1/2, tighten limiting and degrade; alpha ∈ [0.1, 0.3].
- Escalation order: lower rps (Chapter 7) → lower K_thr (Chapter 6) → enable fallback → rollback release → freeze changes.
- Reset: only relax when Burn_rate(w2) < 0.5 continuously over w2 and P99 <= L_obj.
IX. Contract Assertion Templates (I70-8)
- Availability: {"type":"slo_avail","target":0.999,"window":"30d"}
- Latency: {"type":"slo_latency","threshold_ms":200,"quantile":0.99,"window":"7d"}
- Budget burn: {"type":"burn_guard","w1":"5m","w2":"1h","b1":4.0,"b2":1.5}
- Channel stability: {"type":"queue_bound","chan":"ingress","q_hi_frac":0.8}
X. Coupling to the Execution Graph (Chapter 2 Alignment)
- Node-level SLIs: expose samples for w(v) and P99(v) per v ∈ V; for edges e, expose transfer cost c(e) and retry rate.
- Critical-path report:
T_make(G) approx ∑(P50(w) on crit(G)) + ∑(P50(c) on crit(G)), plus P90/P99 to assess tail risk. - Path availability: estimate A_path via S78-6 and compare to targets.
XI. SLO Parameter Suggestions (Baseline Bands)
- Online APIs: SLO_avail ∈ [0.999, 0.9999], L_obj ∈ [150 ms, 300 ms], SLA_window ∈ {"7d","30d"}.
- Event streams: SLO_avail ∈ [0.995, 0.999]; with batch ACKs, target P99(W_q) <= 1 s.
- Batch: primary SLI is batch completion: SLI_batch = succeeded_jobs / total_jobs, with a deadline-attainment SLO.
XII. Operational Flow Mx-7 (Rollout and Governance)
- Define “Good” and L_obj/SLO; bake into contracts and code comments.
- Deploy histograms and counters; validate label sets and bucket boundaries; enable trace_span and trace_link.
- Install multi-window burn guards: w1, w2, b1, b2; wire alerts and rollback scripts.
- Shadow observe during canary: compare QPS/P99/ErrRate/rho/bp.level to baseline.
- Open gates and release; if a gate trips, follow the escalation order in VIII; emit an SLA_window report post-release.
- Periodic review: recalibrate epsilon_window, run χ² tests for bucket-drift, tune bucket edges and sampling rates.
XIII. Interface Bindings (I70-7 / I70-8)
- Metric emission
- metric_emit("threads.req_total", 1, {endpoint, prio})
- metric_emit("threads.good_total", 1, {endpoint})
- metric_emit("chan.q_len", q_len, {chan})
- Tracing
- span = trace_span("svc.handle", attrs={gid,pid_thr,endpoint})
- trace_link(span, eid)
- Contract compute
- sli_slo_compute({"type":"latency","quantile":0.99,"threshold_ms":200}, window="7d")
- assert_thread_contract(G, tests=[...])
XIV. Cross-Volume Anchors and Time Calibration
delta_form = | ( 1 / c_ref ) * ( ∫ n_eff d ell ) - ( ∫ ( n_eff / c_ref ) d ell ) | as the timing-alignment uncertainty.T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) and T_arr = ( ∫ ( n_eff / c_ref ) d ell ), and compute the gap
When SLIs straddle domains, T_arr may serve as a synchronization anchor; provide gamma(ell) and measure d ell. Use both forms
XV. Deliverables and Acceptance Checklist
- Metric dictionary and bucket config; label allow-list; “Good” adjudication rules.
- SLO contracts and error-budget policy; multi-window burn alerting.
- Trace sampling and retention policy; log schema and privacy mask plan.
- Baseline report for QPS/P99/ErrRate/W_q/rho/bp.level and A_path; drill records and successful rollback evidence.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/