HomeDocs-Technical WhitePaper43-EFT.WP.Data.DatasetCards v1.0

Chapter 6 Provenance & Sampling


I. Chapter Purpose & Scope

Standardize recording of dataset provenance, collection methods, spatiotemporal coverage, and selection bias; fix reproducible sampling strategies and quality-control requirements. All entries use snake_case; citations use “Volume+Version+Anchor”.

II. Terminology & Dependencies


III. Fields & Structure (Normative)

provenance:

collection_method: "<string>" # e.g., beamformed-array / survey / simulation

instruments: # instruments/stations/arrays & channel summary

- {name:"<string>", station:"<string>", role:"<rx/tx/mixed>"}

time_coverage: "<YYYY-MM-DD..YYYY-MM-DD>" # interval; specify closed/semi-open explicitly

spatial_coverage: "<region spec>" # e.g., RA/Dec ranges or tile indices

selection_bias: "<string>" # executable criteria, e.g., flux-limited, SNR>7

permits: ["<license/ref>"] # collection permits/ethics notes (if applicable)

sampling:

strategy: "<random|stratified|systematic|time-based|spatial-tiles>"

strata: # stratification vars & quotas (if applicable)

- {by:"class", buckets:{"FRB":520,"RFI":2100,"Noise":12380}}

rates: {train:0.80, validation:0.10, test:0.10} # must match splits

seed: 12345

replacement: false

dedup_policy: "<per-scan|per-object|per-tile>"

representativeness: "<statement>"

audits: ["coverage", "leakage", "class-imbalance"]

provenance and sampling are record-layer objects; their exported artifacts and references must appear in export_manifest.references[].

IV. Collection Methods & Source Recording


V. Sampling Strategy & Implementation Constraints

  1. strategy:
    • random: global uniform sampling;
    • stratified: by class/region/SNR, etc.;
    • systematic: fixed-step/rule-based;
    • time-based: windows/periodic;
    • spatial-tiles: spatial tiling.
  2. strata: Declare buckets and quotas explicitly; align with quality.coverage class frequencies and coverage metrics.
  3. seed / replacement / dedup_policy: Fix RNG seed; specify with/without replacement; de-dup policy (by observation/object/tile).
  4. audits: At minimum include coverage, leakage (cross-splits object leakage), and class imbalance; push audit results into quality.

VI. Consistency with splits & Leakage Prevention


VII. Metrology, Units & Path Dependence (if applicable)

If provenance/sampling involves path-dependent quantities (e.g., T_arr), also register:
  1. delta_form, path="gamma(ell)", measure="d ell";
  2. Two equivalent expressions coexist:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell );
      and pass dimensional/unit consistency check_dim.

VIII. Quality Control & Representativeness


IX. Compliance & Permissions (if applicable)

For human/geosensitive data, record permits, de-identification strategies, and usage restrictions; align with privacy/ethics extensions and reflect in export_manifest.references[].

X. Example Fragment (drop-in)

provenance:

collection_method: "survey-aggregation"

instruments: [{name:"LOFAR", station:"DE601", role:"rx"}]

time_coverage: "2019-01-01..2024-12-31"

spatial_coverage: "RA[120..240],Dec[-30..+30]"

selection_bias: "flux-limited, SNR>=7"

sampling:

strategy: "stratified"

strata:

- {by:"class", buckets: {"FRB": 520, "RFI": 2100, "Noise": 12380}}

rates: {train:0.80, validation:0.10, test:0.10}

seed: 1701

replacement: false

dedup_policy: "per-object"

audits: ["coverage","leakage","class-imbalance"]

(When exporting, add to export_manifest.references[]: "EFT.WP.Core.DataSpec v1.0:EXPORT", "EFT.WP.Core.Metrology v1.0:check_dim", "EFT.WP.Core.Equations v1.1:S20-1".)


XI. Chapter Compliance Checklist


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/