HomeDocs-Technical WhitePaper34-EFT.WP.Astro.Acceleration v1.0

Chapter 14 Data, Pipelines & Benchmarks


I. Abstract & Scope
This chapter defines unified specifications and release workflows I75-* / M75-* for data, pipelines, and benchmarks: JSON schemas and registries for dataset/model/pipeline cards, data validation and dimensional consistency, pipeline execution and reproducible environments, benchmark suites and acceptance gates, and externally released reproducible bundles with audit trails. All symbols use English notation wrapped in backticks; SI units. Any ToA-related field must record both forms with explicit path gamma(ell) and measure d ell.

II. Dependencies & References

  1. Unified symbols & units: Chapter 2 Tab. 2-1 and P12-*.
  2. Kinematics & channels: Chapter 3 S20-; reconnection/shear: Chapter 4 S30-, Chapter 5 S40-; comparator: Chapter 6 S45-.
  3. Spectrum formation & transport: Chapter 7 S50-, Chapter 8 S52-.
  4. Domain branches: GRB (Chapter 10 M62-), FRB (Chapter 11 M64-).
  5. Simulation stack: Chapter 12 M70-* (products & metrics).
  6. Inference & falsification: Chapter 13 M72-* (evidence, masks, deliverables).

III. Normative Anchors (added in this chapter, I75-/M75-)

  1. I75-0 (Card Schemas & Registry): establish JSON Schemas for three cards—DatasetCard, ModelCard, PipelineCard; unify required fields, Unit/Dim, see: anchors, {code_hash, data_hash}, and versioning.
  2. I75-1 (DataSpec & Field Constraints): every numeric column must carry unit and dim; ToA fields must store T_arr^A, T_arr^B, and delta_form in parallel.
  3. I75-2 (PipelineSpec & DAG): pipelines are directed acyclic graphs G=(V,E) with node types {ingest, calibrate, simulate, fit, validate, export}; nodes declare inputs/outputs/env/seed.
  4. I75-3 (Product Layout & Naming): standard directories: products/, metrics.json, masks/, delta_form.log, repro/, cards/; filenames include {sim_id|run_id|stamp}.
  5. I75-4 (Interface Prototypes):
    • export_dataset_card(ds: DataSpec) -> DatasetCard
    • run_pipeline(p: PipelineCard, cfg: SimCfg) -> ArtifactBundle
    • register_benchmark(bundle) -> BenchmarkID
  6. M75-1 (Ingest & Validation): validate fields/units/dimensions per DataSpec; verify hashes and integrity; emit DatasetCard.
  7. M75-2 (Pipeline Execution & Reproducibility): lock environment (container/dep versions/RNG seeds) and execute per PipelineCard; produce ArtifactBundle and metrics.json.
  8. M75-3 (Benchmarks & Acceptance): apply Chapter 12 metrics/thresholds; run regression; if metrics meet gates, enqueue for release.
  9. M75-4 (Audit & Archival): archive {code_hash, data_hash, rng_state, SimCfg, cards, masks, delta_form}; emit an audit manifest.
  10. M75-5 (Release & Versioning): semantic versioning MAJOR.MINOR.PATCH; MAJOR changes ship compatibility notes and migration scripts; release bundle is repro_bundle.

IV. Body Structure


I. DatasetCard

  1. Purpose & scope: describe origin, conventions, units, and covariance of raw/processed datasets.
  2. Required fields:
    • meta: {dataset_id, version, instrument, band, time_span}
    • spec: {columns:[{name, unit, dim, description, see}], sampling, calibration}
    • quality: {systematics, covariance, masks}
    • integrals: {path:"gamma(ell)", measure:"d ell"} (for ToA-related columns)
    • hash: {data_hash, card_hash}
    • see: anchors to volumes/sections
  3. Dual-form ToA: store side-by-side
    T_arr^A = ( 1 / c_ref ) * ( ∫ n_eff d ell ) and T_arr^B = ( ∫ ( n_eff / c_ref ) d ell ), with delta_form.

II. ModelCard

  1. Purpose & scope: describe model/parameterization and priors, versioning, and compatibility.
  2. Required fields:
    • model_id, version, family (S30/S40/S50/S52/…)
    • params: {name, transform, prior, bounds, unit, dim}
    • hyper: hierarchical priors and shared hyperparameters
    • channels: switches and default weights for {A_rec, A_shear, A_dsa, A_turb}
    • diagnostics: summaries of evidence and information criteria from training/fits
    • hash: {code_hash, card_hash}

III. PipelineCard

  1. Purpose & scope: standardize a reproducible execution graph.
  2. Required fields:
    • pipeline_id, version; graph: nodes/edges
    • node[i]: {type, inputs, outputs, image/env, seed, resources}
    • acceptance: thresholds mapped to Chapter 12 gates
    • exports: {products/, metrics.json, masks/, delta_form.log, repro/}
    • provenance: {who, when, where} aligned with {code_hash, data_hash}

IV. Validation, Execution & Release


V. Cross-References within/beyond this Volume


VI. Validation, Criteria & Counterexamples

  1. Positive criteria:
    • DatasetCard/ModelCard/PipelineCard pass schema and Unit/Dim checks.
    • All metrics.json indicators meet or exceed thresholds.
    • Reproduction in an independent environment succeeds with matching hashes.
  2. Negative criteria:
    • Dimensional closure fails; ToA not stored in dual form or path not explicit.
    • Regression degrades beyond thresholds versus prior release.
    • Audit manifest lacks critical {hash/seed/SimCfg} fields.
  3. Contrasts:
    • Minimal-change regressions for {data-card only, model-card only, pipeline-card only}.
    • Compare ToA {Form A, Form B, A+B} impacts on products and evidence.

VII. Summary & Handoff
This chapter standardizes data–pipeline–benchmark schemas, execution, and release via I75-* / M75-*, ensuring dimensional consistency, verifiable gates, and full-chain reproducibility, aligned with the metrics and evidence systems of Chapters 12–13. Chapter 15 proceeds to “Implementation Bindings & APIs” (I80-*) for external interfaces and acceptance use cases.

V. Figures & Tables (this chapter)

Card

Required fields (subset)

DatasetCard

dataset_id, version, columns{name,unit,dim}, covariance, masks, data_hash, see

ModelCard

model_id, version, params{name,prior,bounds,unit,dim}, hyper, code_hash, family

PipelineCard

pipeline_id, version, graph{nodes,edges}, env, seed, acceptance, exports

type

required

outputs

notes

ingest

uri, schema

staged data

validation/standardization

calibrate

calib, masks

calib data

systematics correction

simulate

SimCfg

products/

see Chapter 12

fit

ModelCard

posterior, evidence

see Chapter 13

validate

thresholds

metrics.json

acceptance gates

export

targets

bundle

release artifacts

key

example

purpose

sim_id

ASTROACC_GRB_M_v1

global index

code_hash

sha256:…

provenance

data_hash

sha256:…

integrity

rng_state

JSON

reproduction

delta_form

A/B

ToA form flag

Metric

Threshold

Gate

SpecMAE

≤ 3%

pass/fail

IndexErr

≤ 0.05

pass/fail

LagRMS

≤ 5%

pass/fail

PA_RMS

≤ 3°

pass/fail

ToAΔ

≤ 0.1 ms

pass/fail

path

content

cards/

DatasetCard/ModelCard/PipelineCard

products/

synthetic & fitted products

metrics.json

metrics & gate results

masks/

dominant energy/time masks

delta_form.log

ToA dual-form records

repro/

environment lock & scripts


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/