Home / Docs-Technical WhitePaper / 46-EFT.WP.Data.Benchmarks v1.0
Chapter 12 Robustness, Shift & Adversarial
I. Chapter Purpose & Scope
evaluation in benchmarks: shift types and severity scales, adversarial threat models and parameters, evaluation protocol and thresholds, reporting format and statistical significance, and linkage with scoring/ranking/gates; ensure consistency with task definitions, metric system, evaluation protocol, metrology, and citation anchors.adversarial, and distribution shift, robustnessFix specifications forII. Terminology & Dependencies
- Terms: synthetic_shift, natural_shift, severity, Δ_rel (relative drop), adv.threat_model (whitebox|blackbox|transfer), ‖δ‖_p ≤ ε, attack_steps/restarts/targeted, robust_accuracy, auc_robust.
- Dependencies: metrics & units (Ch.6), evaluation protocol (Ch.7), runtime environment (Ch.10), scoring & gates (Ch.8), units & dimensions (Core.Metrology v1.0:check_dim).
- Math & symbols: wrap inline symbols; any division/integral/composite operator must use parentheses; for T_arr use
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
declaring gamma(ell) and d ell. No Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
robustness:
shift_tests:
- {name:"snr_drop", severity:[3,6,9], unit:"dB", policy:"additive-noise"}
- {name:"time_jitter",ms:[5,10,20], policy:"shuffle-window"}
- {name:"spec_notch", bands:[["0.3","0.5"],["0.6","0.7"]], unit:"fraction"}
natural_shifts:
axes: ["device","region","season","domain","locale"]
splits: ["val","test"]
adversarial:
enabled: false
threat_model: "whitebox|blackbox|transfer"
norm: "Linf|L2|L1"
epsilon: 0.01
steps: 10
restarts: 1
targeted: false
metrics:
primary: ["Δ_rel","acc_robust","auc_robust"]
curves: ["acc-vs-ε","acc-vs-SNR","acc-vs-mask"]
thresholds:
drop_rel_max: 0.10
acc_robust_min: 0.80
ece_max_under_shift: 0.05
reporting:
table_axes: ["shift","severity","metric"]
include_ci: true
significance: {method:"bootstrap", B:10000, alpha:0.05}
online_consistency:
shadow_mode: true
window: "7d"
drift_monitors: ["drift_kl","psi"]
alert_rules:
- {name:"robust_drop", rule:"Δ_rel>0.10 for 60m", severity:"high"}
IV. Shift Types & Severity Scales
- Synthetic shifts:
- snr_drop: additive noise with severity in dB; declare noise type (Gaussian/colored), random seed, and injection point (pre/post normalization).
- time_jitter: window reshuffle/jitter; specify ms window and boundary handling.
- spec_notch: spectral band notches; declare normalized band ranges and mask policy (zero/median).
- Natural shifts: axes device/region/season/domain/locale; report coverage and sample counts and check consistency with Dataset Card coverage.
V. Adversarial Evaluation (Threat Models & Parameters)
- Threat models: whitebox (e.g., PGD), blackbox (score/decision-based), transfer.
- Constraints: enforce ‖δ‖_p ≤ ε; provide steps/restarts/targeted.
- Safety guardrails: adversarial samples used offline or as shadow traffic only; never deploy unisolated into production paths.
VI. Metrics, Thresholds & Coupling
- Relative drop: Δ_rel = ( baseline - under_shift ) / max( baseline, ε ).
- Robust accuracy: acc_robust at a given severity or worst-case over a set.
- Area metrics: auc_robust over ε/SNR/mask spans.
- Calibration drift: report ECE/Brier under shift and compare with ece_max_under_shift.
- Gates: if Δ_rel>drop_rel_max or acc_robust<acc_robust_min or ECE exceeds the ceiling → release blocking; align with Ch.8 scoring gates.
VII. Statistics & Reporting
- Significance: default bootstrap (B≥10k, α=0.05) with CI_95; apply Holm–Bonferroni for multi-model/multi-axis comparisons.
- Format: tables keyed by shift/severity/metric; include curves (acc-vs-ε/SNR/mask) and key point estimates.
VIII. Metrology & Units (SI)
- Performance & resources: QPS(1/s), latency_ms.{p50,p95,p99}, ρ(—), net_mbps, size_bytes.
- Mandatory: metrology:{units:"SI", check_dim:true}; normalize units first before composition/comparison.
- Path quantities: if robustness experiments involve T_arr-related processing or metrics, register delta_form/path/measure and validate using the two equivalences.
IX. Machine-Readable Fragment (Drop-in)
robustness:
shift_tests:
- {name:"snr_drop", severity:[3,6,9], unit:"dB", policy:"additive-noise"}
- {name:"time_jitter", ms:[5,10,20], policy:"shuffle-window"}
- {name:"spec_notch", bands:[["0.3","0.5"],["0.6","0.7"]], unit:"fraction"}
natural_shifts: {axes:["device","region"], splits:["val","test"]}
adversarial: {enabled:false, threat_model:"whitebox", norm:"Linf", epsilon:0.01, steps:10, restarts:1, targeted:false}
metrics: {primary:["Δ_rel","acc_robust"], curves:["acc-vs-ε","acc-vs-SNR"]}
thresholds: {drop_rel_max:0.10, acc_robust_min:0.80, ece_max_under_shift:0.05}
reporting: {table_axes:["shift","severity","metric"], include_ci:true, significance:{method:"bootstrap", B:10000, alpha:0.05}}
online_consistency:
shadow_mode: true
window: "7d"
drift_monitors: ["drift_kl","psi"]
alert_rules: [{name:"robust_drop", rule:"Δ_rel>0.10 for 60m", severity:"high"}]
metrology: {units:"SI", check_dim:true}
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: SHIFT.SPEC_DEFINED
when: "$.robustness.shift_tests[*]"
assert: "has_keys(name) and (has_key(severity) or has_key(ms) or has_key(bands))"
level: error
- id: ADV.THREAT_ALLOWED
when: "$.robustness.adversarial.threat_model"
assert: "value in ['whitebox','blackbox','transfer']"
level: error
- id: ADV.PARAMS_VALID
when: "$.robustness.adversarial"
assert: "value.enabled == false or (has_keys(norm, epsilon, steps) and epsilon > 0 and steps >= 1)"
level: error
- id: METRIC.THRESHOLDS_DEFINED
when: "$.robustness.thresholds"
assert: "has_keys(drop_rel_max, acc_robust_min)"
level: error
- id: REPORT.CI_REQUIRED
when: "$.robustness.reporting"
assert: "value.include_ci == true and has_keys(significance.method, significance.alpha)"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. Cross-Reference Anchors
- Metric system & units: EFT.WP.Data.Benchmarks v1.0, Ch.6.
- Scoring, normalization & gates: Ch.8.
- Evaluation protocol & runtime environment: EFT.WP.Data.ModelCards v1.0, Ch.11; this volume, Ch.10.
- Unit & dimension checks: EFT.WP.Core.Metrology v1.0:check_dim.
XII. Chapter Compliance Checklist
- Synthetic/natural shift and adversarial settings complete; threat model, norm, and ε/steps/restarts/targeted explicit.
- Metrics & thresholds present; jointly report Δ_rel/acc_robust/auc_robust and calibration drift; gates aligned with Ch.8.
- Significance & CI configuration (with multiple-comparison correction) active; reports include tables and curves.
- SI metrology with check_dim=true; if T_arr appears, delta_form/path/measure registered and validated.
- Machine-readable fragment is drop-in and lint-clean; online consistency (if applicable) includes shadow/drift monitors and alert rules.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/