Home / Docs-Technical WhitePaper / 19-EFT.WP.Methods.SynthData v1.0
Chapter 3 — Data Models & Schema Binding (Schema/SRef)
I. Scope & Objects
- Goals
- Establish a unified schema SRef and a registration system schema_reg for synthetic data, completing schema binding from heterogeneous D_real to normalized D_ref.
- Define field-level and relationship-level constraints (keys, units & dimensions, nullability, enumerations, ranges, resolution, time/path semantics) to provide a single authoritative convention for generation, evaluation, and publication.
- Inputs
- Source schemas & samples: schemas_raw = {schema_i}, D_real.
- Constraints & policies: Rules, policy.units, policy.nulls, policy.enums.
- Time/path anchors: tau_mono, ts, gamma(ell).
- Outputs
- Unified SRef, alias map alias_map, unit/dimension declarations unit/dim, relationships & index set {pk,fk,idx_k}.
- Bound dataset D_ref, validation report report_schema, and manifest manifest.synth.schema.*.
- Boundaries & Assumptions
- No implicit field inference is accepted; derived fields must publish lineage and hash_sha256(blob).
- Any field involving arrival time must include both formulations and delta_form (see Chapter 2, P402-5).
II. Terms & Variables
- Schemas & registry: SRef (canonical schema), schema_reg (registry), alias_map : name_src -> name_ref.
- Field metadata: name, role, dtype, unit(x), dim(x), nullable ∈ {0,1}, enum, range = [lo,hi], resolution Δx.
- Keys & relations: pk, fk(parent.child), idx_k, cardinality ∈ {1:1, 1:N, N:M}.
- Time & path: ts, tau_mono, T_arr, gamma(ell), offset/skew/J.
- Retention & traceability: rid, sid, pid, TraceID, signature.
- Multimodal binding: bundle = {tabular, image, text, audio, graph}, view_id.
III. Axioms P403- (Non-Negotiables for Schema/SRef)*
- P403-1 (Single Source of Truth): SRef is the SSOT for fields and relations; all implementations follow it and are versioned.
- P403-2 (Keys & Relations): unique(pk) holds; foreign_key integrity holds with no orphans; cardinality is explicit and enforced.
- P403-3 (Units & Dimensions): Any field entering computation declares unit(x), dim(x) and passes check_dim(expr).
- P403-4 (Time & Arrival): Evaluate windows on tau_mono and publish on ts; fields for arrival time must expose both formulations and delta_form.
- P403-5 (Naming & Collisions): Forbidden collisions: do not mix T_fil with T_trans; strictly distinguish n from n_eff; normalize aliases via alias_map.
- P403-6 (Nullability & Missingness): Missingness must be flagged with m ∈ {0,1}; implicit fills are disallowed (see Chapter 7).
- P403-7 (Multimodal Consistency): Align view_id across views; cross-modal fields follow a shared primary key or an explicit mapping.
- P403-8 (Traceability): Publish lineage, hash_sha256(blob), and signature in the manifest.
- P403-9 (Compatibility & Closure): SRef version changes are backward-compatible or accompanied by a migration map.
- P403-10 (Privacy First): When sensitive fields exist, de-identification/minimization must be applied at the schema layer (see Chapter 10 on privacy).
IV. Minimal Equations S403- (Necessary Formulae for Schema/SRef)*
- S403-1 (Schema Coverage)
- cov_schema = | F_real ∩ F_sref | / | F_sref |
- cov_req = | F_required ∩ F_real | / | F_required |, with the requirement cov_req ≥ cov_req_min.
- S403-2 (Affine Unit Conversion)
x_SI = a * x_raw + b, where a,b are determined by the map unit(x_raw) -> unit(x_SI); assert check_dim( x_SI - ( a * x_raw + b ) ) = true. - S403-3 (Type-Cast Loss)
loss_cast = E[ | x - cast( x ; dtype_src -> dtype_dst ) | ], requiring loss_cast ≤ tol_cast. - S403-4 (Time Mapping & Jitter)
ts = map_tau_to_ts( tau_mono; offset, skew ), and measure J (jitter). - S403-5 (Dual Arrival-Form Difference)
delta_form = | ( 1 / c_ref ) * ( ∫ n_eff d ell ) - ( ∫ ( n_eff / c_ref ) d ell ) |. - S403-6 (Referential Integrity)
orphan_rate = 1 - ( | join(parent.pk = child.fk) | / | child | ), requiring orphan_rate = 0. - S403-7 (Index Selectivity)
sel(idx_k) = 1 - ( | distinct(idx_k) | / | D_ref | ), used to assess query and streaming join costs.
V. Synthesis Flow M40-3 (Schema Binding)
- Readiness
Consolidate schemas_raw and samples; freeze retained keys {rid,sid,pid,TraceID}; draft alias_map; collect unit/dim/enum/range. - Design SRef
For each field define {role,dtype,unit,dim,nullable,enum,range,resolution}; declare {pk,fk,idx_k,cardinality}; annotate time/path semantics and arrival-time fields. - Binding
Run standardize_names and repair_units; build the m mask; normalize types and units; compute ts and both T_arr formulations. - Validation
Evaluate cov_schema, cov_req, loss_cast, orphan_rate, delta_form, offset/skew/J; run check_dim and key/relation assertions. - Persistence
Produce D_ref, report_schema, and manifest.synth.schema (including versioning, indexes, time-base, arrival-time, metrics, and signature). - Versioning & Migration
On version upgrades, produce migrate_map(v_k -> v_{k+1}) with compatibility proof; record rollback points and audit trail.
VI. Contracts & Assertions C40-31x (Schema/SRef)
- C40-311 (Coverage): cov_req ≥ cov_req_min and cov_schema ≥ cov_schema_min.
- C40-312 (Uniqueness & Referential): unique(pk) = true, all foreign_key checks pass, orphan_rate = 0.
- C40-313 (Units & Dimensions): check_dim(expr) = true, unit_map complete; loss_cast ≤ tol_cast.
- C40-314 (Time/Arrival): non_decreasing(tau_mono), delta_form ≤ tol_Tarr, J ≤ J_max.
- C40-315 (Naming Conflicts): forbid_conflict_names = true, alias_map is complete and unambiguous.
- C40-316 (Multimodal Consistency): view_id aligned within the bundle; cross-modal joins pass contract tests.
- C40-317 (Traceability & Signature): manifest.synth.schema.signature is valid; hash_sha256(blob) and TraceID archived.
VII. Implementation Bindings I40- (Interface Prototypes & Invariants)*
- I40-31 design_synth_spec(schema, goals, constraints) -> SynthSpec
- I40-32 register_schema(SRef) -> schema_id (versioned, dependency closure)
- I40-33 standardize_names(ds, registry) -> ds'
- I40-34 repair_units(ds, policy) -> report_units
- I40-35 validate_dataset(ds, SRef, rules) -> report_schema (keys/relations/units/dimensions/time/arrival/naming conflicts)
- I40-36 bind_modalities(bundle, SRef) -> bundle' (cross-modal alignment and key consistency)
- I40-37 migrate_schema(ds, from_ver, to_ver, migrate_map) -> ds'
- I40-38 emit_schema_manifest(SRef, report) -> manifest.synth.schema
- Invariants: unique(schema_id); alias_map is acyclic and injective; sum(missing_mask) = count_nullables + violations; loss_cast ≤ tol_cast; delta_form ≤ tol_Tarr.
VIII. Cross-References
- Methods.Cleaning v1.0: Chapter 3 (standard inputs & schema binding), Chapter 4 (units & dimensions), Chapter 5 (timeline & synchronization), Chapter 6 (paths & arrival time), Chapter 10 (release freeze).
- Methods.Imaging v1.0: Chapter 9 (geometric calibration & registration; cross-modal key consistency).
- Methods.CrossStats v1.0: Chapter 7 (drift & alignment; schema stability under enumerated/distributional drift).
- EFT.WP.Core.DataSpec v1.0 and Core.Threads v1.0: keys, indexes, and execution-graph constraints.
IX. Quality Metrics & Risk Control
- Metric set
cov_schema, cov_req, loss_cast, orphan_rate, enum_drift, schema_bind_latency_p99, timing.{offset,skew,J}, arrival.delta_form. - Risk policies
- Insufficient coverage: block publication; trigger field convergence and supplemental data acquisition.
- Referential failure: roll back the binding and quarantine; open orphan-repair tickets.
- Units/dimensions failure: reject downstream generation stages; require unit_map correction.
- Arrival-time breach: review gamma(ell) and the n_eff/c_ref gauges; recompute if necessary.
- Multimodal inconsistency: downgrade to single-modal release or defer publication with an alignment plan.
Summary
- This chapter delivers a closed loop for SRef design and binding: constrain with P403-*, define computable metrics with S403-*, and realize the standard flow Ready → Bind → Validate → Persist → Version/Migrate via M40-3, C40-31x, and I40-*.
- The deliverables—normalized D_ref, SRef, alias_map, report_schema, and manifest.synth.schema.*—provide a solid foundation for subsequent generation engines and compliant publication.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/