15-EFT.WP.Methods.Falsification v1.0 | Chapter 7: Statistical Testing & Error Control

Home ／ Docs-Technical WhitePaper (V6.0) ／ 15-EFT.WP.Methods.Falsification v1.0

Chapter 7: Statistical Testing & Error Control

I. Scope & Objectives

Establish the statistical testing and error-control framework for falsification, covering significance & power, sample-size planning, equivalence & non-inferiority tests, multiple-testing control (FDR/FWER), and sequential/adaptive tests with significance-budget spending. Unify offline batch evaluation and online gating so that GateDecision ∈ {pass, hold, block} is driven by statistical evidence and risk budgets. All procedures execute under a locked environment EnvLock and the shared time base ts = alpha + beta * tau_mono.
Conflict-name disambiguation
To avoid confusion with the time-base mapping parameters alpha, beta, this chapter denotes significance and Type-II error as alpha_sig and beta_err, respectively; power is power = 1 - beta_err.

II. Terms & Symbols

Hypotheses & statistics
H0, H1, T(x) (test statistic), C_alpha (rejection region), p_value.
Effect sizes: d = ( mu_1 - mu_0 ) / sigma_pooled, OR (odds ratio), ΔAUC, ΔECE, ΔNLL.
Equivalence / non-inferiority margins: delta_equiv, delta_noninf.
Errors & power
alpha_sig (Type-I error), beta_err (Type-II error), power = 1 - beta_err.
Multiple testing: m (number of tests), R (rejections), V (false rejections),
FDR = E[ V / max(R,1) ], FWER = P( V ≥ 1 ), q_star (target FDR).
Sequential testing & budgets
Likelihood-ratio sequence Lambda_n, thresholds A, B, spending function alpha_spend(t); family-wise budget alpha_family.
Sample size & quantiles
z_{p} (normal quantile), t_{p,df} (t quantile), n_per_group (per-group size), N_min (minimum total size).

III. Postulates & Minimal Equations

P51-10 (Familywise significance-budget postulate)
For a family {H0_i} with allocations {alpha_i} satisfying Σ alpha_i ≤ alpha_family, and using a conservative or one-step adjustment, FWER ≤ alpha_family.
P51-11 (Consistency of significance spending)
In sequential/adaptive testing, if Σ_{t=1..T} alpha_spend(t) ≤ alpha_family and the stopping rule is measurable with respect to sample paths under H0, then global Type-I error is controlled: P_H0( reject ) ≤ alpha_family.
S52-18 (p-value & rejection region)
One-sided: p_value = P( T ≥ T_obs | H0 ); two-sided:
p_value = 2 * min{ P( T ≥ T_obs | H0 ), P( T ≤ T_obs | H0 ) }; rule: p_value ≤ alpha_sig → reject H0.
S52-19 (Power definition)
power = P( T ∈ C_{alpha_sig} | H1 ) = 1 - beta_err.
S52-20 (Two independent means, z-test, known variance)
n_per_group = ( ( z_{1 - alpha_sig/2} + z_{1 - beta_err} )^2 * 2 * sigma^2 ) / delta_min^2,
where delta_min = | mu_1 - mu_0 | is the minimal detectable effect.
S52-21 (Two-proportions sample-size approximation)
With target proportions p1, p2, p_bar = ( p1 + p2 ) / 2:
n_per_group = ( z_{1 - alpha_sig/2} * sqrt( 2 * p_bar * ( 1 - p_bar ) ) +
z_{1 - beta_err} * sqrt( p1 * ( 1 - p1 ) + p2 * ( 1 - p2 ) ) )^2 / ( p1 - p2 )^2.
S52-22 (Benjamini–Hochberg, FDR control)
Sort p_(1) ≤ ... ≤ p_(m), take
k = max{ i : p_(i) ≤ ( i / m ) * q_star };
reject {H0_(1)..H0_(k)}, ensuring FDR ≤ q_star (independence or positive dependence).
S52-23 (Holm step-down, FWER control)
Sort p_(1) ≤ ... ≤ p_(m); sequentially test p_(i) ≤ alpha_sig / ( m - i + 1 ). On the first failure, stop and accept all remaining; FWER ≤ alpha_sig.
S52-24 (Hierarchical gatekeeping & priorities)
With tiers L1 → L2 → ... and budgets {alpha_l}, if Lk fails, do not release the budget of Lk+1; if Lk passes, roll unspent alpha_k into the next tier:
alpha_{k+1} ← alpha_{k+1} + unspent(alpha_k).
S52-25 (TOST equivalence test)
H0: | mu - mu0 | ≥ delta_equiv; H1: | mu - mu0 | < delta_equiv.
Two one-sided tests:
T1 = ( ( mu - mu0 ) - ( - delta_equiv ) ) / SE,
T2 = ( ( mu - mu0 ) - ( + delta_equiv ) ) / SE;
conclude equivalence iff p1 ≤ alpha_sig and p2 ≤ alpha_sig.
S52-26 (Non-inferiority test)
H0: mu_ref - mu_cand ≥ delta_noninf; if
P( mu_ref - mu_cand < delta_noninf ) ≥ 1 - alpha_sig
or an equivalent one-sided test is significant, declare non-inferiority.
S52-27 (SPRT boundaries)
Lambda_n = Π_{i=1..n} ( f_1( x_i ) / f_0( x_i ) );
if Lambda_n ≥ A = ( 1 - beta_err ) / alpha_sig → reject H0;
if Lambda_n ≤ B = beta_err / ( 1 - alpha_sig ) → accept H0; otherwise continue sampling.
S52-28 (Significance spending functions)
Given total budget alpha_family and a spending curve alpha_spend(t), ensure Σ_{i=1..t} alpha_spend(i) ≤ alpha_family. Example (O’Brien–Fleming type):
alpha_spend^{OF}(t) = 2 - 2 * Phi( z_{alpha_family/2} / sqrt(t) ).

IV. Data & Manifest Conventions

HypothesisRegistry (minimum fields)
{hid, H0, H1, effect_size_spec, delta_equiv?, delta_noninf?, metric, tail ∈ {one, two}, alpha_sig, beta_err, power_target, assumptions}.
TestPlan.card
{design ∈ {two-sample, paired, proportion, nonparam}, n_per_group|N_min, allocation_ratio, blocking, stratification, seeds, prereg_sig: alpha_sig, prereg_beta: beta_err}.
MultiTest.family
{scope, members[hid], control ∈ {BH, Holm, Bonferroni, gatekeeping}, q_star|alpha_family, dependency_assumption}.
SeqTest.rule
{type ∈ {SPRT, alpha-spending}, params{A,B|alpha_spend(•)}, stop ∈ {accept, reject, maxN}, monitoring_window}.
Provenance outputs
Each run emits {p_table.csv, adj_p.csv, decision.log, power_check.json, ci_table.csv, alpha_budget.yaml, hash(•), fingerprint}.

V. Algorithms & Implementation Bindings

Mapping to I50-*
Multiple testing: I50-6 sequential_test (when type = alpha-spending), I50-9 gate_release (consuming FDR/FWER reports & evidence bundle).
Statistical computation extensions
- I50-11 adjust_pvalues(p:list, method:str, q_or_alpha:float) -> {p_adj:list, reject:list}
- I50-12 plan_sample_size(spec:dict) -> {n_per_group:int, power:float}
- I50-13 tost_equivalence(x:any, y:any, delta_equiv:float, alpha_sig:float) -> Verdict
Reference flow (BH step-up)
- Input p[1..m], q_star; sort to p_(i).
- Compute thresholds tau_i = ( i / m ) * q_star.
- k = max{ i : p_(i) ≤ tau_i }; set reject[1..k] = true, others false.
- Produce adjusted p-values:
  p_adj_(i) = min_{j ≥ i} ( m / j ) * p_(j ), then map back to original indices.
Reference flow (Holm step-down)
- Sort p_(i); for i = 1..m, test
  p_(i) ≤ alpha_sig / ( m - i + 1 ).
- If the first failure occurs at i*, reject {1..i*-1} and accept {i*..m}; if none fail, reject {1..m}.
Reference flow (SPRT)
- Initialize A, B; update Lambda_n per observation.
- If Lambda_n ≥ A → reject; if Lambda_n ≤ B → accept; if n ≥ N_cap → stop = hold.
- Output {decision, n_used, alpha_spent ≈ P_H0( reject )}.

VI. Metrology Flows & Run Diagram

Mx-59 Sample-size planning & pre-registration
From effect_size_spec, alpha_sig, beta_err compute n_per_group; produce TestPlan.card and alpha_budget.yaml; freeze seeds and analysis script hashes.
Mx-60 Multiple testing & family-wise control
Define families and hierarchies; choose BH/Holm/gatekeeping; output adj_p.csv and decision.log. If FDR > q_star or FWER > alpha_family, set GateDecision = hold.
Mx-61 Sequential/online testing with gating
Configure SeqTest.rule and monitoring windows; run I50-6 sequential_test; integrate with TS.error / TS.latency, and upon block/hold record stopping evidence and cumulative Σ alpha_spend.

VII. Verification & Test Matrix

Type-I calibration (null simulations)
- Under H0, repeat B times (B ≥ 10^4) to estimate P( reject ); require | P( reject ) - alpha_sig | ≤ tau_calib.
- In multiple-testing settings, estimate FDR/FWER; verify they do not exceed budgets.
Power & sample-size backchecks
- Under H1, estimate power_hat; require power_hat ≥ power_target - tau_power.
- CI coverage: two-sided 1 - alpha_sig intervals cover at 1 - alpha_sig ± tau_cov.
Sequential robustness
Optional stopping / data-peeking simulations: under alpha_spend constraints, verify no Type-I inflation; compare expected sample size of SPRT against N_cap.
Assumption checks & robustness
When normality/homoscedasticity fail, use permutation or bootstrap for p_value and CIs; record deviations.

VIII. Cross-References & Dependencies

Depends on: Core.Metrology (metrics & confidence), Core.Errors (error types & thresholds), Core.DataSpec (data conventions).
Cross-links: Chapter 3 (postulates; power, FDR, sequential tests), Chapter 8 (uncertainty; tests for ECE/MCE/NLL), Chapter 9 (online gating; linkage to GateDecision and budget spending).

IX. Risks, Limitations & Open Questions

Risks & limitations
BH failure under dependent p_value; assumption breakage under distribution shift; implicit multiplicity from multi-metric scanning; uncontrolled optional stopping inflating Type-I error; unreliable asymptotics for extreme sparsity.
Open questions
Investment-style online FDR fused with gatekeeping; cross-domain/device calibration of a shared alpha_budget; precise power analysis for complex metrics such as ΔECE/ΔNLL.

X. Deliverables & Versioning

Deliverables
HypothesisRegistry.json, TestPlan.card, alpha_budget.yaml, p_table.csv, adj_p.csv, decision.log, power_check.json, ci_table.csv, SeqTest.rule, SeqTest.log, Evidence.bundle (with hash(•) and fingerprint).
Versioning policy
- Adjusting alpha_sig / beta_err / power_target or the family-control method → minor bump; changing the significance-budgeting or sequential rules → major bump.
- All changes require updated signatures and Appendix C history entries.

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05