Home / Docs-Technical WhitePaper / 45-EFT.WP.Data.Pipeline v1.0
Chapter 11 Versioning, Provenance & Lineage
I. Chapter Purpose & Scope
specifications: version locking for objects and artifacts, hashing and traceability, lineage graphs and replay, change notices and compatibility policy, audit trail and export manifest; ensure consistency with data contracts, Dataset/Model Cards, the Metrology chapter, and citation anchors.lineage, and provenance, versioningFix pipelineII. Terminology & Dependencies
- Terms: semver, artifact, digest/sha256, lineage.graph, provenance, repro (reproducibility), replay, dual-write, compat_mode.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/coverage & quality (DatasetCards v1.0); evaluation/feature & I/O assumptions (ModelCards v1.0).
- Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
versioning:
scheme: "semver" # vMAJOR.MINOR.PATCH
stability_line: "v1.*"
compat_mode: "forward|backward|both|break"
notice:
type: "release|correction|withdrawal"
summary: "<text>"
date: "<YYYY-MM-DD>"
provenance:
sources: ["<uri-or-ref>", "..."] # upstream references (reference-only)
transforms: ["<stage-name>@vX.Y", "..."]
environment:
containers: ["<image@digest>", "..."]
deps_lock: "locks/deps.lock.yaml"
seeds: {global: 1701}
lineage:
graph:
nodes:
- {id:"src.s3.pull", kind:"stage", version:"v1.0"}
- {id:"schema.check", kind:"stage", version:"v1.2"}
- {id:"feat.map", kind:"stage", version:"v1.1"}
- {id:"train_pkg", kind:"artifact", digest:"sha256:..."}
edges:
- {from:"src.s3.pull", to:"schema.check"}
- {from:"schema.check", to:"feat.map"}
- {from:"feat.map", to:"train_pkg"}
replay:
enabled: true
inputs_lock: "locks/inputs.manifest.json" # source list + offsets/watermarks
policy: "strict|lenient"
artifacts:
- {path:"pipeline.yaml", sha256:"<hex>"}
- {path:"locks/inputs.manifest.json", sha256:"<hex>"}
- {path:"locks/deps.lock.yaml", sha256:"<hex>"}
- {path:"outputs/train_pkg.tgz", sha256:"<hex>"}
IV. Versioning Strategy & Stability Line
- Format: vMAJOR.MINOR.PATCH; MAJOR for breaking changes, MINOR for backward-compatible additions, PATCH for fixes and doc corrections.
- Stability line: public references should target v1.*; evaluation/release materials should pin to a minor.
- Compatibility mode: forward|backward|both|break aligned with Schema/contract compat_mode; break requires shadow comparison and notice.
- Notices: record notices under versioning.notice and in export_manifest.references[].
V. Provenance & Reproducibility
- Sources: provenance.sources[] records source identifiers (reference-only); watermarks/offsets and time ranges are locked in inputs.manifest.json.
- Environment: container images use immutable digests; deps.lock.yaml lists dependency versions and hashes.
- Randomness: fix seeds; for non-deterministic operators, declare library versions and deterministic backends.
VI. Lineage Graph & Replay
- Lineage graph: lineage.graph contains nodes (stages/artifacts) and directed edges; nodes carry versions/hashes; the graph must have no dangling nodes.
- Replay: when replay.enabled=true, constrain input sets and order via inputs_lock; policy:"strict" demands byte-identical results, "lenient" allows bounded non-deterministic drift with a stated tolerance.
- Dual-write window: for breaking migrations, use dual-write, compare diffs, and cut over within thresholds.
VII. Artifact Hashing & Integrity
- Mandatory hashing: all critical artifacts (configs/locks/packages/reports) require sha256; integrity failures are blocking.
- Manifest parity: artifacts[] and export_manifest.artifacts[] must match (paths + hashes); optionally record SIZE/LASTMOD.
VIII. Metrology & Units (SI)
- Performance & resources: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—), net_mbps, size_bytes.
- Mandatory: metrology:{units:"SI", check_dim:true}; normalize units first before composition/conversion.
- Path quantities: if lineage covers arrival-time/correction chains, register delta_form, path="gamma(ell)", measure="d ell"; use:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
and pass check_dim.
IX. Machine-Readable Fragment (Drop-in)
versioning:
scheme: "semver"
stability_line: "v1.*"
compat_mode: "both"
notice: {type:"release", summary:"initial stable", date:"2025-09-21"}
provenance:
sources: ["s3://eift-data/raw/2025/09/", "contracts/raw_rows@v1.2"]
transforms: ["schema.check@v1.2", "feat.map@v1.1"]
environment:
containers: ["ghcr.io/eift/pipeline@sha256:abcdef..."]
deps_lock: "locks/deps.lock.yaml"
seeds: {global:1701}
lineage:
graph:
nodes:
- {id:"src.s3.pull", kind:"stage", version:"v1.0"}
- {id:"schema.check", kind:"stage", version:"v1.2"}
- {id:"feat.map", kind:"stage", version:"v1.1"}
- {id:"train_pkg", kind:"artifact", digest:"sha256:1234..."}
edges:
- {from:"src.s3.pull", to:"schema.check"}
- {from:"schema.check", to:"feat.map"}
- {from:"feat.map", to:"train_pkg"}
replay: {enabled:true, inputs_lock:"locks/inputs.manifest.json", policy:"strict"}
artifacts:
- {path:"pipeline.yaml", sha256:"..."}
- {path:"locks/inputs.manifest.json", sha256:"..."}
- {path:"locks/deps.lock.yaml", sha256:"..."}
- {path:"outputs/train_pkg.tgz", sha256:"..."}
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: VER.SEMVER
when: "$.versioning.scheme"
assert: "value == 'semver' and matches($.pipeline.version, '^v\\d+\\.\\d+(\\.\\d+)?$')"
level: error
- id: VER.COMPAT_ALLOWED
when: "$.versioning.compat_mode"
assert: "value in ['forward','backward','both','break']"
level: error
- id: LIN.GRAPH_CONNECTED
when: "$.lineage.graph"
assert: "graph_is_connected(value) and no_dangling_nodes(value)"
level: error
- id: LIN.REPLAY_INPUTS_LOCK
when: "$.lineage.replay.enabled"
assert: "value == false or has_key($.lineage.replay.inputs_lock)"
level: error
- id: ART.SHA256_REQUIRED
when: "$.artifacts[*]"
assert: "has_key('sha256') and len(value.sha256) > 0"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. Export Manifest & Audit Trail
export_manifest:
version: "v1.0"
artifacts:
- {path:"pipeline.yaml", sha256:"..."}
- {path:"locks/inputs.manifest.json", sha256:"..."}
- {path:"locks/deps.lock.yaml", sha256:"..."}
- {path:"lineage/graph.json", sha256:"..."}
- {path:"reports/replay.result.json", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
XII. Chapter Compliance Checklist
- Version follows semver; stability_line matches public references; compat_mode explicit; break changes shadow-compared and noticed.
- Provenance complete: sources, watermarks/offsets, environment & dependency locks, random seeds; no duplication of data facts.
- Lineage graph connected with no dangling nodes; when replay is enabled, provide inputs_lock and strict/lenient posture.
- All critical artifacts carry sha256 and are registered in export_manifest; performance/resource metrology uses SI with check_dim=true.
- For path quantities T_arr, delta_form/path/measure registered and validated; frozen splits and evaluation protocol aligned with Dataset/Model Cards.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/