Home / Docs-Technical WhitePaper / 19-EFT.WP.Methods.SynthData v1.0
Chapter 4 — Generation Engines I: Statistical & Explicit Models (Copula/GLM/Rules)
I. Scope & Objects
- Goals
- Implement controllable and auditable synthetic data generation via statistical and explicit modeling, covering tabular data, counts, proportions, and partially sequential events.
- Express joint dependence with copulas, conditional distributions with GLMs, and enforce business/physical consistency with rules and constraints.
- Inputs
Normalized data D_ref (see Chapter 3), canonical schema SRef, constraint sets Rules and Constraints, time/path anchors tau_mono, ts, gamma(ell). - Outputs
- Statistical generation engine engine_stat, including marginals, dependence copula, conditional models glm_family, and compiled ruleset.
- Synthetic samples D_syn, evaluation & contract report report_stat, and manifest manifest.synth.stat.*.
- Applicability
Prefer for medium dimensionality and sample sizes and for compliance-critical scenarios; for extremely high dimensions or complex textures, defer to Chapter 5 (deep generators) or Chapter 6 (physics/scene methods).
II. Terms & Variables
- Data & distributions: x = (x_1,...,x_d), F_i, f_i, u_i = F_i(x_i), C(u; psi).
- Copula parameters: R (correlation for Gaussian copula), nu (degrees of freedom for t copula), tau_K (Kendall’s tau).
- GLM: X ∈ R^{N×p}, y, eta = X beta, g(mu) = eta, mu = E[y|X], phi (dispersion).
- Rules & constraints: A x ≤ b, h(x) = 0, rule_k, domain_k, discrete constraints x_j ∈ enum_j.
- Time & arrival: tau_mono, ts, T_arr, gamma(ell), delta_form, offset/skew/J.
- Other: seed, rng, w (weights), pi0 (zero inflation), U = k * u_c (expanded uncertainty).
*III. Axioms P404- **
- P404-1 (Explicit Measures): Decompose the joint via a copula and marginals; declare domains and measures explicitly.
- P404-2 (Dimensional Conservation): unit(x), dim(x) are mandatory; check_dim(expr) must pass.
- P404-3 (Separated Dependence): Dependence is carried solely by the copula; marginal transforms preserve ranks.
- P404-4 (GLM Family First): Prefer exponential-family models with canonical links; handle over-dispersion with NB or quasi-likelihood.
- P404-5 (Constraints First): Prefer constructive satisfaction of constraints over rejection sampling; if rejection is used, record the acceptance rate.
- P404-6 (Time-Base Consistency): Generate on tau_mono; publish mapped ts and record offset/skew/J.
- P404-7 (Dual Arrival-Time Forms): When T_arr is involved, compute both formulations and emit delta_form.
- P404-8 (Reproducibility): Persist seed/rng in the manifest to guarantee reproducibility.
- P404-9 (Privacy Upfront): If the engine is DP-enabled (DP(eps, delta)), account the budget here and propagate through to publication.
- P404-10 (Stability): Provide convergence criteria and fallback strategies for all optimization/fitting steps.
*IV. Minimal Equations S404- **
- S404-1 (Sklar Decomposition)
p(x) = c(u; psi) * ∏_{i=1}^d f_i(x_i), where u_i = F_i(x_i). - S404-2 (Gaussian Copula Density)
- z_i = Phi^{-1}(u_i), c_R(u) = |R|^{-1/2} * exp( - 0.5 * z^T * ( R^{-1} - I ) * z ).
- rho = sin( ( pi / 2 ) * tau_K ) (mapping from Kendall’s tau_K to Pearson rho).
- S404-3 (Sampling Steps)
- u ~ C(u; psi)
- x_i = F_i^{-1}( u_i )
- For discrete fields x_j, apply quantize_to_enum( x_j, enum_j ) and resolution constraints.
- S404-4 (GLM Base Forms)
g( mu ) = X beta, mu = E[y|X]; canonical families:- Bernoulli: mu = 1 / ( 1 + exp( - X beta ) ).
- Poisson: mu = exp( X beta ).
- NegativeBinomial: Var(y|X) = mu + kappa * mu^2.
- S404-5 (Zero-Inflated Mixture)
p(y) = pi0 * 1[y = 0] + ( 1 - pi0 ) * p_base( y ; mu, ... ). - S404-6 (Constraint Projection)
x_proj = argmin || x - x0 ||_2 s.t. A x ≤ b , h(x) = 0 (constructive satisfaction). - S404-7 (Time Mapping & Arrival Time)
- ts = map_tau_to_ts( tau_mono ; offset, skew );
- delta_form = | ( 1 / c_ref ) * ( ∫ n_eff d ell ) - ( ∫ ( n_eff / c_ref ) d ell ) |.
- S404-8 (Marginal Goodness-of-Fit)
KS_i = sup_x | F_i(x) - F_i^{syn}(x) |; AD_i and CvM_i optional.
V. Metrology Flow M40-4 (Statistical/Explicit Generation)
- Ready
Select fields F_gen, constraints Constraints and Rules; lock unit/dim and time/arrival fields. - Fit Marginals
For each x_i, fit F_i and f_i (parametric, KDE, or quantile splines); record uncertainty U_i. - Fit Dependence
Estimate tau_K or the rank correlation matrix; fit the copula (Gaussian/t/vine), yielding psi or R. - Conditional Models
For specified y, fit GLM or zero-inflated/hurdle models; choose link g and family; check for over-dispersion. - Sampling & Rules
Sample u -> x, synthesize y; run the rule compiler over A x ≤ b, h(x)=0, and enumerations; use projection or minimal rejection as needed. - Time-Base & Arrival
Generate on tau_mono, map to ts; when T_arr is present, compute both formulations and write delta_form. - Validation & Persistence
Compute KS_i, tau_K gaps, mean/variance alignment errors, acceptance rate; execute contracts C40-41x; emit D_syn and manifest.synth.stat.
VI. Contracts & Assertions C40-41x
- C40-411 (Marginal Consistency): For all i, KS_i ≤ tol_KS_i, and mean/variance deviations stay within thresholds.
- C40-412 (Dependence Consistency): | tau_K^{real} - tau_K^{syn} | ≤ tol_tau; if using R, then || R_real - R_syn ||_F ≤ tol_R.
- C40-413 (Discrete & Range Compliance): x_j ∈ enum_j and within range_j; violation rate ≤ tol_range.
- C40-414 (GLM Calibration): cal_error ≤ tol_cal; over-dispersion ratio phi/phi_ref ≤ tol_phi.
- C40-415 (Rule Satisfaction): Constraint satisfaction ≥ sat_min; rejection rate ≤ rej_max.
- C40-416 (Time & Arrival): non_decreasing(tau_mono), J ≤ J_max, delta_form ≤ tol_Tarr.
- C40-417 (Units/Dimensions): check_dim(expr)=true; unit mappings complete.
- C40-418 (Reproducibility & Signature): seed/rng persisted; signature valid.
VII. Implementation Bindings I40- (Interfaces & Invariants)*
- I40-41 fit_marginals(ds, spec) -> marginals
- I40-42 fit_engine_copula(ds, family, opts) -> engine_copula (where family ∈ {gaussian, t, vine})
- I40-43 sample_copula(engine_copula, marginals, n, seed) -> df_u2x
- I40-44 fit_engine_glm(ds, formula, family, link) -> model_glm
- I40-45 sample_glm(model_glm, X_new|X_syn, seed) -> y_syn
- I40-46 compile_rules(SRef, constraints, policies) -> ruleset
- I40-47 enforce_rules(ds_syn, ruleset, mode) -> ds_syn' (where mode ∈ {project, reject})
- I40-48 timepath_align_for_synth(ds_syn, ref) -> ds_syn' (write offset/skew/J, T_arr, delta_form)
- I40-49 evaluate_stat_contracts(real, syn, rules) -> report_stat
- I40-4A emit_stat_manifest(artifacts) -> manifest.synth.stat
- Invariants: reproducible(seed); ||R||_2 ≤ 1; sat_rate ≥ sat_min; rej_rate ≤ rej_max; delta_form ≤ tol_Tarr; unit/dimension conservation holds.
VIII. Cross-References
- Methods.SynthData v1.0: Chapter 5 (deep generation), Chapter 6 (physics/scene graphs) for high-dimensional or strongly constrained cases.
- Methods.Cleaning v1.0: Chapter 4 (units & dimensions), Chapter 5 (timeline), Chapter 6 (arrival time).
- Methods.CrossStats v1.0: Chapter 4 (estimation & intervals), Chapter 7 (drift & alignment) for evaluation and baseline refresh.
- Methods.Imaging v1.0: Chapter 5 (PSF/OTF/MTF) when rules involve optical and radiometric calibration.
IX. Quality Metrics & Risk Control
- Metrics
- Marginals: {KS_i, AD_i, CvM_i}, mean/variance deviations, category coverage.
- Dependence: tau_K gaps, R error, tail joint-probability error.
- Generation efficiency: accept_rate, latency_p99_ms, throughput_qps.
- Compliance module: delta_form, J, psi (for post-deployment drift monitoring).
- Risk policies
- Dependence bias: switch to vine copula or increase flexibility in rank-correlation fitting.
- Over-dispersion: switch from Poisson to NB or add random effects (see Chapter 12).
- Constraint shortfall: prefer projection mode or constructive sampling; if needed, downgrade release.
- Time/arrival anomalies: review map_tau_to_ts and medium parameters; block publication until delta_form passes.
Summary
- This chapter delivers a closed loop for statistical & explicit generation: P404-* as guardrails, S404-* as computable bases, M40-4 as the process spine, plus C40-41x contracts and I40-* bindings.
- The outputs—engine_stat, D_syn, report_stat, and manifest.synth.stat.*—form a reproducible, auditable foundation for subsequent deep/physics methods and release freeze.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/