Home / Docs-Technical WhitePaper / 43-EFT.WP.Data.DatasetCards v1.0
Chapter 6 Provenance & Sampling
I. Chapter Purpose & Scope
Standardize recording of dataset provenance, collection methods, spatiotemporal coverage, and selection bias; fix reproducible sampling strategies and quality-control requirements. All entries use snake_case; citations use “Volume+Version+Anchor”.II. Terminology & Dependencies
- Terminology source: General terms follow Core.Terms v1.0. This chapter only adds field names and constraints directly related to provenance/sampling. Cross-volume citations must carry version and P/S/M/I anchors.
- Dependent volumes: Data contracts & export: Core.DataSpec v1.0; units/dimensions & uncertainty: Core.Metrology v1.0; arrival-time/path-dependent equations: Core.Equations v1.1.
III. Fields & Structure (Normative)
provenance:
collection_method: "<string>" # e.g., beamformed-array / survey / simulation
instruments: # instruments/stations/arrays & channel summary
- {name:"<string>", station:"<string>", role:"<rx/tx/mixed>"}
time_coverage: "<YYYY-MM-DD..YYYY-MM-DD>" # interval; specify closed/semi-open explicitly
spatial_coverage: "<region spec>" # e.g., RA/Dec ranges or tile indices
selection_bias: "<string>" # executable criteria, e.g., flux-limited, SNR>7
permits: ["<license/ref>"] # collection permits/ethics notes (if applicable)
sampling:
strategy: "<random|stratified|systematic|time-based|spatial-tiles>"
strata: # stratification vars & quotas (if applicable)
- {by:"class", buckets:{"FRB":520,"RFI":2100,"Noise":12380}}
rates: {train:0.80, validation:0.10, test:0.10} # must match splits
seed: 12345
replacement: false
dedup_policy: "<per-scan|per-object|per-tile>"
representativeness: "<statement>"
audits: ["coverage", "leakage", "class-imbalance"]
provenance and sampling are record-layer objects; their exported artifacts and references must appear in export_manifest.references[].IV. Collection Methods & Source Recording
- collection_method: Unified enumeration (examples: beamformed-array, drift-scan, survey-aggregation, simulation).
- instruments: Include at least name and station/array ID; if metrology/calibration is involved, bind calibration method/date in optional sensor_profile.
- time_coverage / spatial_coverage: Use explicit intervals and coordinate postures; avoid implicit closed/open intervals and projection ambiguities. Units/CRS provided in metrology or this chapter’s fields.
- selection_bias: Express as executable criteria (thresholds/rules/whitelists) and report representational impact in the quality section.
V. Sampling Strategy & Implementation Constraints
- strategy:
- random: global uniform sampling;
- stratified: by class/region/SNR, etc.;
- systematic: fixed-step/rule-based;
- time-based: windows/periodic;
- spatial-tiles: spatial tiling.
- strata: Declare buckets and quotas explicitly; align with quality.coverage class frequencies and coverage metrics.
- seed / replacement / dedup_policy: Fix RNG seed; specify with/without replacement; de-dup policy (by observation/object/tile).
- audits: At minimum include coverage, leakage (cross-splits object leakage), and class imbalance; push audit results into quality.
VI. Consistency with splits & Leakage Prevention
- sampling.rates must match splits.{train,validation,test}.ratio within ≤ 1e-6.
- For object/sequence data, the same object or adjacent time windows must not appear across split sets (leakage prevention).
- Record audit outcomes (coverage, leakage, class balance) in quality.coverage and encode thresholds/pass criteria in quality.gates.
VII. Metrology, Units & Path Dependence (if applicable)
If provenance/sampling involves path-dependent quantities (e.g., T_arr), also register:- delta_form, path="gamma(ell)", measure="d ell";
- Two equivalent expressions coexist:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell );
and pass dimensional/unit consistency check_dim.
VIII. Quality Control & Representativeness
- Coverage: Report sample counts, spatiotemporal coverage, class/modality distributions, and confidence intervals.
- Representativeness: Evaluate bias against target distributions (physical classes, scenes, environments) and provide correction or weighting schemes.
- Sampling error: Register systematic/random components and combination rule (e.g., rss) under the uncertainty extension.
IX. Compliance & Permissions (if applicable)
For human/geosensitive data, record permits, de-identification strategies, and usage restrictions; align with privacy/ethics extensions and reflect in export_manifest.references[].X. Example Fragment (drop-in)
provenance:
collection_method: "survey-aggregation"
instruments: [{name:"LOFAR", station:"DE601", role:"rx"}]
time_coverage: "2019-01-01..2024-12-31"
spatial_coverage: "RA[120..240],Dec[-30..+30]"
selection_bias: "flux-limited, SNR>=7"
sampling:
strategy: "stratified"
strata:
- {by:"class", buckets: {"FRB": 520, "RFI": 2100, "Noise": 12380}}
rates: {train:0.80, validation:0.10, test:0.10}
seed: 1701
replacement: false
dedup_policy: "per-object"
audits: ["coverage","leakage","class-imbalance"]
(When exporting, add to export_manifest.references[]: "EFT.WP.Core.DataSpec v1.0:EXPORT", "EFT.WP.Core.Metrology v1.0:check_dim", "EFT.WP.Core.Equations v1.1:S20-1".)
XI. Chapter Compliance Checklist
- provenance/sampling fields exist and satisfy this chapter’s schema; sampling.rates matches splits; leakage audits are completed and recorded.
- All cross-volume citations use "Volume vX.Y:Anchor" and are present in export_manifest.references[]; no shortcodes or missing versions.
- For path-dependent quantities, delta_form, path, and measure are registered and check_dim passes.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/