Home / Docs-Technical WhitePaper / 05-EFT.WP.Core.Errors v1.0
Chapter 7 — Recovery Strategies and Robust Operation
I. Objectives and Scope
- Objective: on top of the established logging–traceability–diagnostics loop, define verifiable and composable recovery strategies that keep the system available and traceable in the presence of error e, heavy-tailed residuals r, environmental drift, and numerical instabilities.
- Scope: three classes of strategies—retry(policy), fallback(models, voting), and graceful_degradation(state, rules)—and their coupling to SLI/SLO, the error budget EB, and the arrival-time quantity T_arr = ( ∫_gamma ( n_eff / c_ref ) d ell ).
II. State and Terminology
- Runtime states: OK, WARN, ERROR, DEGRADED, FALLBACK, RECOVERED.
- Triggers: chi2 = r^T R r, |r_bar|, pass_rate, drift_score(p,q,"KL"), latency_ms.
- Cost and utility: C(op) (operational cost), U_svc(mode) (service utility, mode ∈ {full,partial,minimal}).
- Success probability and budget: p_succ, B_retry (retry budget in attempts or time).
III. Postulates (Recovery and Robustness)
- P77-1 (Finite budget): any strategy must satisfy expected_cost ≤ B_retry; if exceeded, switch to degradation or fallback.
- P77-2 (Monotone improvement): if a strategy does not improve all of chi2/dof, |r_bar|_max, and U, do not re-enter the same strategy branch.
- P77-3 (Evidence binding): every recovery decision must share the trace_id of its trigger event and must attach the used artifacts into the traceability_chain (see Chapter 6).
IV. Retry Strategies (Retry)
- Scheduling model
- Exponential backoff: t_k = min( t_init * alpha^k + jitter_k , t_max ), with alpha > 1, jitter_k ~ Uniform( -beta * t_k , beta * t_k ).
- Bounded attempts: after N_max tries, enter fallback or degradation.
- Success probability and budget
- Under i.i.d. approximation, P_succ(N) = 1 - (1 - p_succ)^N.
- Expected cost: E[C_retry] = ∑_{k=0}^{N-1} P(fail^k) * C_attempt(k), which must satisfy P77-1.
- Numerical-stability retries
- Mesh refinement: if p_hat < p_target (see Chapter 5), retry with h_{k+1} = h_k / 2; note that C_attempt(k) rises with workload.
- Rounding compensation: enable compensated summation, then retry; record the delta in E_round_hat into EB.
- Triggers and stopping (S77-1)
- Trigger: (chi2/dof > chi2_max) OR (pass_rate < target) OR (drift_score ≥ drift_max).
- Stop: (P_succ(N) ≥ P_min) OR (E[C_retry] > B_retry) OR (k ≥ N_max).
V. Fallback Strategies (Fallback)
- Fallback pool and voting
- Model set: models = [m_0, m_1, ..., m_K], where m_0 is the primary model.
- Voting schemes:
- weighted: y_hat = ∑_j w_j * m_j(x), w_j >= 0, ∑ w_j = 1; take w_j ∝ 1 / RMSE_j on validation or 1 / chi2_j online.
- median-of-means: median across block means, robust to heavy tails.
- Switching criterion (S77-2)
If there exists j such that chi2_j/dof < chi2_0/dof - delta_chi2 and U_j ≤ U_budget, switch to m_j or adopt a weighted ensemble. - Traceability and consistency
Log model_id_from -> model_id_to, evidence_refs, and expected_delta(chi2); report fallback=models, voting=....
VI. Degradation Strategies (Graceful Degradation)
- Modes
- full: all features enabled;
- partial: disable non-critical or expensive paths while keeping compliant outputs for the key measurand and U;
- minimal: output the minimally viable set only (value, U, EB, traceability_chain).
- Rule expression (S77-3)
- Predicate–action pairs: if cond(x, r, SLI) then action(mode, knobs).
- Example: if (latency_ms > L_max) AND (chi2/dof ≤ chi2_soft) then action(partial, {disable_heavy_postproc=true}).
- Decisions and guard band
Conformity decisions still use the metrology guard_band(result, U, tol); in DEGRADED, reports must state mode and effective_tol.
VII. Strategy Composition and Priority
- Composition sequence (S77-4)
- Default order: retry → fallback → graceful_degradation.
- If drift_score exceeds threshold, prefer fallback (model switch) over blind retries.
- Decision function
policy_decision = argmax_{strategy} { E[U_svc] - lambda * E[C(strategy)] }, with lambda > 0 the cost weight. - Parallelism and mutual exclusion
Do not run fallback and degradation concurrently within the same span_id; allow cross-span parallelism but share the same trace_id.
VIII. Coupling with SLI/SLO
- Use sli_slo_compute outputs as gates:
Example: pass_rate ≥ 0.99 and latency_p95 ≤ 200 ms → OK; otherwise enter WARN/ERROR and trigger strategies. - Adaptive thresholds
Under heavy-tail StudentT(nu) noise, gate with quantiles: |r_bar|_q ≤ t_q, q ∈ {0.90, 0.95}, to avoid mean sensitivity.
IX. Robust Operation Example for Arrival Time T_arr
- Trigger context
While computing T_arr = ( ∫_gamma ( n_eff / c_ref ) d ell ), observe chi2/dof = 1.9, p_hat = 2.7, drift_score = 0.13. - Strategy execution
- retry: refine h -> h/2 and enable compensated summation; stop if E[C_retry] > B_retry.
- fallback: switch the n_eff estimator from m_0 to m_1 (heavy-tail regression or StudentT(nu)); record model_id migration and delta(chi2).
- graceful_degradation: if latency_p95 still fails, move to partial mode, pin c_ref to nominal, and defer fine-grained corr_env(•; RefCond) evaluation.
- Reporting essentials
mode="partial", and include value, U, EB, path_spec, h, p_hat, traceability_chain; state the chosen “two-form” for T_arr and the rationale.
X. Interface Mapping and Constraints
- retry(policy:dict) -> callable
Keys: t_init, alpha, beta, t_max, N_max, B_retry, P_min, p_target. - fallback(models:list, voting:str="weighted") -> any
Keys: weights or block_size (for median-of-means), delta_chi2, U_budget. - graceful_degradation(state:any, rules:dict) -> any
Keys: modes={full,partial,minimal}, knobs, effective_tol. - Common constraints
Each strategy action must call log_event and update the traceability_chain (see Chapter 6); the minimal evidence set may not be bypassed.
XI. Recovery Workflow Mx-5 (Decide → Execute → Verify)
- Read monitors: chi2/dof, |r_bar|_max, pass_rate, drift_score, latency_p95.
- Select strategy: compute policy_decision per S77-4; if drift_score exceeds threshold, prioritize fallback.
- Execute action: run retry or fallback or graceful_degradation; log costs and evidence.
- Post verification: re-measure SLIs and metrology items (value, U, EB); if P77-2 shows no improvement, switch branches.
- Converge and exit: reach SLO or exhaust branches; emit RECOVERED or persist a DEGRADED report.
XII. Recovery-Specific Minimal Reporting Fields
- strategy, params_used, attempts, E[C], P_succ_hat, delta(chi2), delta(latency_p95).
- mode (if degraded), effective_tol, features_disabled.
- model_transition (if fallback), and weights or block_size.
XIII. Safety and Compliance
- Resource gates: enforce B_retry and N_max for retry and fallback to prevent cascading amplification.
- Auditability: retain events and artifacts for every failed branch with hash and created_at.
XIV. Chapter Outputs and Linkage
- Outputs: postulates P77-1…P77-3, strategy criteria and composition S77-1…S77-4, workflow Mx-5, and the minimal reporting set.
- Next: Chapter 8 will integrate these strategies with the two T_arr conventions and the I40/I50 interfaces for end-to-end regression and cross-volume consistency validation.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/