Home / Docs-Technical WhitePaper / 34-EFT.WP.Astro.Acceleration v1.0
Chapter 14 Data, Pipelines & Benchmarks
I. Abstract & Scope
This chapter defines unified specifications and release workflows I75-* / M75-* for data, pipelines, and benchmarks: JSON schemas and registries for dataset/model/pipeline cards, data validation and dimensional consistency, pipeline execution and reproducible environments, benchmark suites and acceptance gates, and externally released reproducible bundles with audit trails. All symbols use English notation wrapped in backticks; SI units. Any ToA-related field must record both forms with explicit path gamma(ell) and measure d ell.
II. Dependencies & References
- Unified symbols & units: Chapter 2 Tab. 2-1 and P12-*.
- Kinematics & channels: Chapter 3 S20-; reconnection/shear: Chapter 4 S30-, Chapter 5 S40-; comparator: Chapter 6 S45-.
- Spectrum formation & transport: Chapter 7 S50-, Chapter 8 S52-.
- Domain branches: GRB (Chapter 10 M62-), FRB (Chapter 11 M64-).
- Simulation stack: Chapter 12 M70-* (products & metrics).
- Inference & falsification: Chapter 13 M72-* (evidence, masks, deliverables).
III. Normative Anchors (added in this chapter, I75-/M75-)
- I75-0 (Card Schemas & Registry): establish JSON Schemas for three cards—DatasetCard, ModelCard, PipelineCard; unify required fields, Unit/Dim, see: anchors, {code_hash, data_hash}, and versioning.
- I75-1 (DataSpec & Field Constraints): every numeric column must carry unit and dim; ToA fields must store T_arr^A, T_arr^B, and delta_form in parallel.
- I75-2 (PipelineSpec & DAG): pipelines are directed acyclic graphs G=(V,E) with node types {ingest, calibrate, simulate, fit, validate, export}; nodes declare inputs/outputs/env/seed.
- I75-3 (Product Layout & Naming): standard directories: products/, metrics.json, masks/, delta_form.log, repro/, cards/; filenames include {sim_id|run_id|stamp}.
- I75-4 (Interface Prototypes):
- export_dataset_card(ds: DataSpec) -> DatasetCard
- run_pipeline(p: PipelineCard, cfg: SimCfg) -> ArtifactBundle
- register_benchmark(bundle) -> BenchmarkID
- M75-1 (Ingest & Validation): validate fields/units/dimensions per DataSpec; verify hashes and integrity; emit DatasetCard.
- M75-2 (Pipeline Execution & Reproducibility): lock environment (container/dep versions/RNG seeds) and execute per PipelineCard; produce ArtifactBundle and metrics.json.
- M75-3 (Benchmarks & Acceptance): apply Chapter 12 metrics/thresholds; run regression; if metrics meet gates, enqueue for release.
- M75-4 (Audit & Archival): archive {code_hash, data_hash, rng_state, SimCfg, cards, masks, delta_form}; emit an audit manifest.
- M75-5 (Release & Versioning): semantic versioning MAJOR.MINOR.PATCH; MAJOR changes ship compatibility notes and migration scripts; release bundle is repro_bundle.
IV. Body Structure
I. DatasetCard
- Purpose & scope: describe origin, conventions, units, and covariance of raw/processed datasets.
- Required fields:
- meta: {dataset_id, version, instrument, band, time_span}
- spec: {columns:[{name, unit, dim, description, see}], sampling, calibration}
- quality: {systematics, covariance, masks}
- integrals: {path:"gamma(ell)", measure:"d ell"} (for ToA-related columns)
- hash: {data_hash, card_hash}
- see: anchors to volumes/sections
- Dual-form ToA: store side-by-side
T_arr^A = ( 1 / c_ref ) * ( ∫ n_eff d ell ) and T_arr^B = ( ∫ ( n_eff / c_ref ) d ell ), with delta_form.
II. ModelCard
- Purpose & scope: describe model/parameterization and priors, versioning, and compatibility.
- Required fields:
- model_id, version, family (S30/S40/S50/S52/…)
- params: {name, transform, prior, bounds, unit, dim}
- hyper: hierarchical priors and shared hyperparameters
- channels: switches and default weights for {A_rec, A_shear, A_dsa, A_turb}
- diagnostics: summaries of evidence and information criteria from training/fits
- hash: {code_hash, card_hash}
III. PipelineCard
- Purpose & scope: standardize a reproducible execution graph.
- Required fields:
- pipeline_id, version; graph: nodes/edges
- node[i]: {type, inputs, outputs, image/env, seed, resources}
- acceptance: thresholds mapped to Chapter 12 gates
- exports: {products/, metrics.json, masks/, delta_form.log, repro/}
- provenance: {who, when, where} aligned with {code_hash, data_hash}
IV. Validation, Execution & Release
- Data validation (M75-1): enforce schema and Unit/Dim audits; ensure ToA columns state path and measure explicitly.
- Pipeline execution (M75-2): fix seed and environment; produce artifacts and metrics.json; failing nodes must return a minimal replayable state.
- Acceptance (M75-3): compare against Chapter 12 metrics; produce pass/fail and diffs.
- Release (M75-5): pack ArtifactBundle, all three cards, and repro_bundle into the registry; generate indices and retrieval keys.
V. Cross-References within/beyond this Volume
- Metrics & gates: Chapter 12 (SpecMAE/LagRMS/PA_RMS/ToAΔ).
- Evidence & masks: Chapter 13 (posterior, evidence, masks, falsification_line).
- ToA fields: Chapters 7–8 (spectrum/transport mapping); Chapters 10–11 (timebase & path corrections).
- Model families & params: Chapters 4–6 (S30/S40/S45) and Chapters 7–8 (S50/S52).
VI. Validation, Criteria & Counterexamples
- Positive criteria:
- DatasetCard/ModelCard/PipelineCard pass schema and Unit/Dim checks.
- All metrics.json indicators meet or exceed thresholds.
- Reproduction in an independent environment succeeds with matching hashes.
- Negative criteria:
- Dimensional closure fails; ToA not stored in dual form or path not explicit.
- Regression degrades beyond thresholds versus prior release.
- Audit manifest lacks critical {hash/seed/SimCfg} fields.
- Contrasts:
- Minimal-change regressions for {data-card only, model-card only, pipeline-card only}.
- Compare ToA {Form A, Form B, A+B} impacts on products and evidence.
VII. Summary & Handoff
This chapter standardizes data–pipeline–benchmark schemas, execution, and release via I75-* / M75-*, ensuring dimensional consistency, verifiable gates, and full-chain reproducibility, aligned with the metrics and evidence systems of Chapters 12–13. Chapter 15 proceeds to “Implementation Bindings & APIs” (I80-*) for external interfaces and acceptance use cases.
V. Figures & Tables (this chapter)
- Tab. 14-1 Minimal required fields for the three cards
Card | Required fields (subset) |
|---|---|
DatasetCard | dataset_id, version, columns{name,unit,dim}, covariance, masks, data_hash, see |
ModelCard | model_id, version, params{name,prior,bounds,unit,dim}, hyper, code_hash, family |
PipelineCard | pipeline_id, version, graph{nodes,edges}, env, seed, acceptance, exports |
- Tab. 14-2 Pipeline node types & fields
type | required | outputs | notes |
|---|---|---|---|
ingest | uri, schema | staged data | validation/standardization |
calibrate | calib, masks | calib data | systematics correction |
simulate | SimCfg | products/ | see Chapter 12 |
fit | ModelCard | posterior, evidence | see Chapter 13 |
validate | thresholds | metrics.json | acceptance gates |
export | targets | bundle | release artifacts |
- Tab. 14-3 Registry keys & audit items
key | example | purpose |
|---|---|---|
sim_id | ASTROACC_GRB_M_v1 | global index |
code_hash | sha256:… | provenance |
data_hash | sha256:… | integrity |
rng_state | JSON | reproduction |
delta_form | A/B | ToA form flag |
- Tab. 14-4 Acceptance thresholds (map to Chapter 12)
Metric | Threshold | Gate |
|---|---|---|
SpecMAE | ≤ 3% | pass/fail |
IndexErr | ≤ 0.05 | pass/fail |
LagRMS | ≤ 5% | pass/fail |
PA_RMS | ≤ 3° | pass/fail |
ToAΔ | ≤ 0.1 ms | pass/fail |
- Tab. 14-5 Release-bundle layout
path | content |
|---|---|
cards/ | DatasetCard/ModelCard/PipelineCard |
products/ | synthetic & fitted products |
metrics.json | metrics & gate results |
masks/ | dominant energy/time masks |
delta_form.log | ToA dual-form records |
repro/ | environment lock & scripts |
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/