Home / Docs-Technical WhitePaper / 45-EFT.WP.Data.Pipeline v1.0
Chapter 8 Feature Pipelines & Reuse
I. Chapter Purpose & Scope
specifications: feature extraction/aggregation/alignment, dictionary & embedding management, materialization & caching, cross-task/multi-modal reuse, versioning & dependency mapping; ensure consistency with data contracts, Model Card feature space & task I/O, the Metrology chapter, and citation anchors.feature pipelineFixII. Terminology & Dependencies
- Terms: feature_space, dict_ref, embedding_store, materialize, cache, ttl, point-in-time (PIT alignment), surrogate_key.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); training data & splits (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0, Ch.6 & Ch.9).
- Math & symbols: wrap inline symbols (e.g., x_t, μ, σ, Δt, T_arr) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "<feat.map|feat.aggregate|feat.join|feat.encode|feat.embed|feat.materialize>"
type: "feature.<op>"
impl: "I16-4.<impl_id>"
inputs: ["<Σ_in>"]
outputs: ["<Σ_out>"]
params:
key: ["<entity_id>", "<ts?>"]
point_in_time:
enabled: true
lookback: "PT7D|P30D|N/A"
tolerance: "PT5M"
dict_ref: "dicts/<name>@vX.Y"
embed:
store: "faiss|annoy|milvus|custom"
dim: 768
metric: "cosine|l2"
index_ref: "embeddings/<name>@vX.Y"
aggregate:
window: "PT1H|P1D"
funcs: ["mean","max","count","std"]
fillna: {"method":"pad|zero|drop"}
join:
on: ["<entity_id>","<ts?>"]
how: "left|inner|asof"
materialize:
mode: "none|cache|persist"
cache: {ttl: "P7D", max_gb: 128}
idempotent: true
schema_ref: "contracts/feat_<name>@vX.Y"
feature_space:
type: "<tabular|sequence|image|audio_spec|embedding>"
shape: "<(…)>"
dtype: "<float32|int32|...>"
normalization: "<zscore|minmax|robust|unit-norm|none>"
IV. Feature Operators & Postures
- Mapping/Construction (feat.map): declare feature function/kernel & hyperparameters; emit feature_space with units/dimensions.
- Aggregation (feat.aggregate): sliding/tumbling windows, boundary inclusion, missing policy (fillna); list parameters for multi-window setups.
- Joins (feat.join|feat.asof): keys and temporal alignment; for asof, provide tolerance and direction (backward|forward).
- Encoding (feat.encode): dict_ref version hash, unk/pad policy, OOV handling, and sparse/dense representation.
- Embeddings (feat.embed): vector dimension/metric, index build params & index_ref; include latency/throughput in metrology.
- Materialization (feat.materialize): when mode:"cache|persist", declare store location, ttl, eviction & consistency (full/incremental/bypass).
V. Reuse & Dependency Mapping
- Cross-task reuse: define reusable sets via feature_view, record consumers[] and version constraints; set compat_mode:"forward|backward|both|break".
- Multi-modal reuse: maintain per-modality feature_view and feature_space; register artifacts separately in exports.
- Change impact: breaking changes require shadow comparison & dual-write window; threshold breaches trigger rollback or a backward-compat layer.
VI. Consistency & Point-in-Time (PIT) Alignment
- PIT: historical replay and online inference use the same alignment rules; lookback/tolerance are identical and locked in configs.
- Split consistency: features emitted for training/evaluation must use Dataset Card frozen splits; leakage audits (object/timewindow) are blocking.
VII. Dictionary & Embedding Management
- Dictionaries: dict_ref must carry version & hash; admission policy for new categories unknown|reject|map-to-other; provide stable alias keys for frequently changing vocabularies.
- Embeddings: version index_ref and record training-data time span; if vectors have physical meaning, declare units in feature_space and pass check_dim.
VIII. Metrology & Units (SI)
- Performance: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—); bandwidth net_mbps; storage/index volume size_bytes.
- metrology:{units:"SI", check_dim:true} is mandatory; normalize units first before composition/aggregation.
- For path-quantity features (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalences below, and pass check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
IX. Machine-Readable Fragment (Drop-in)
layers:
- name: "feature"
stages:
- name: "feat.map.stats"
type: "feature.map"
impl: "I16-4.feature_map"
inputs: ["std_rows"]
outputs: ["feat_rows"]
params:
key: ["entity_id","ts"]
point_in_time: {enabled:true, lookback:"P30D", tolerance:"PT5M"}
aggregate: {window:"P1D", funcs:["mean","std","count"], fillna:{method:"pad"}}
idempotent: true
schema_ref: "contracts/feat_stats@v1.1"
feature_space: {type:"tabular", shape:"(N,D)", dtype:"float32", normalization:"zscore"}
- name: "feat.encode.cat"
type: "feature.encode"
impl: "I16-4.encode"
inputs: ["feat_rows"]
outputs: ["feat_enc"]
params:
dict_ref: "dicts/category_voc@v2.0"
encode: {vocab_ref:"dicts/category_voc@v2.0", unk:"<UNK>", pad:"<PAD>"}
idempotent: true
schema_ref: "contracts/feat_enc@v1.0"
- name: "feat.materialize"
type: "feature.materialize"
impl: "I16-4.materialize"
inputs: ["feat_enc"]
outputs: ["feat_pkg"]
params:
materialize: {mode:"cache", cache:{ttl:"P7D", max_gb:256}}
idempotent: true
schema_ref: "contracts/feat_pkg@v1.0"
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: FEAT.FS_REQUIRED
when: "$.layers[*].stages[?(@.type^='feature.')]"
assert: "has_key('feature_space')"
level: error
- id: FEAT.DICT_VERSIONED
when: "$.layers[*].stages[?(@.type=='feature.encode')].params.dict_ref"
assert: "matches('^dicts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"
level: error
- id: FEAT.PIT_PARAMS
when: "$.layers[*].stages[*].params.point_in_time"
assert: "value.enabled == true -> (has_key('lookback') and has_key('tolerance'))"
level: error
- id: FEAT.MATERIALIZE_POLICY
when: "$.layers[*].stages[?(@.type=='feature.materialize')].params.materialize"
assert: "value.mode in ['none','cache','persist']"
level: error
- id: FEAT.UNITS_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
- id: FEAT.LEAKAGE_GUARDS_FOR_TRAIN_EXPORT
when: "$.layers[*].stages[*].outputs"
assert: "produces_train_eval(outputs) -> has_leakage_guards()"
level: error
XI. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"features/feat_view.yaml", sha256:"..."}
- {path:"features/dict_category_v2.hash", sha256:"..."}
- {path:"features/feat_pkg.manifest.json", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.ModelCards v1.0:Ch.6"
- "EFT.WP.Data.ModelCards v1.0:Ch.9"
XII. Chapter Compliance Checklist
- Feature operators type/impl/params complete; feature_space declared and aligned with Model Cards; PIT alignment reproducible.
- dict_ref/index_ref versioned & hashed; OOV/admission policies clear; materialization cache defines ttl and size cap.
- For training/evaluation outputs, splits match Dataset Card frozen splits; leakage guardrails active.
- Performance & units use SI with check_dim=true; if using path quantities T_arr, delta_form/path/measure registered & validated.
- export_manifest lists feature view/dictionaries/materialized package artifacts and anchors with sha256, satisfying release gates.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/