45-EFT.WP.Data.Pipeline v1.0 | Chapter 8 Feature Pipelines & Reuse

Home ／ Docs-Technical WhitePaper (V6.0) ／ 45-EFT.WP.Data.Pipeline v1.0

Chapter 8 Feature Pipelines & Reuse

I. Chapter Purpose & Scope

specifications: feature extraction/aggregation/alignment, dictionary & embedding management, materialization & caching, cross-task/multi-modal reuse, versioning & dependency mapping; ensure consistency with data contracts, Model Card feature space & task I/O, the Metrology chapter, and citation anchors.feature pipelineFix

II. Terminology & Dependencies

Terms: feature_space, dict_ref, embedding_store, materialize, cache, ttl, point-in-time (PIT alignment), surrogate_key.
Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); training data & splits (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0, Ch.6 & Ch.9).
Math & symbols: wrap inline symbols (e.g., x_t, μ, σ, Δt, T_arr) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.

III. Fields & Structure (Normative)

stage:

type: "feature.<op>"

impl: "I16-4.<impl_id>"

inputs: ["<Σ_in>"]

outputs: ["<Σ_out>"]

params:

key: ["<entity_id>", "<ts?>"]

point_in_time:

enabled: true

lookback: "PT7D|P30D|N/A"

tolerance: "PT5M"

dict_ref: "dicts/<name>@vX.Y"

embed:

store: "faiss|annoy|milvus|custom"

dim: 768

metric: "cosine|l2"

index_ref: "embeddings/<name>@vX.Y"

aggregate:

window: "PT1H|P1D"

funcs: ["mean","max","count","std"]

fillna: {"method":"pad|zero|drop"}

join:

on: ["<entity_id>","<ts?>"]

how: "left|inner|asof"

materialize:

mode: "none|cache|persist"

cache: {ttl: "P7D", max_gb: 128}

idempotent: true

schema_ref: "contracts/feat_<name>@vX.Y"

feature_space:

type: "<tabular|sequence|image|audio_spec|embedding>"

shape: "<(…)>"

dtype: "<float32|int32|...>"

normalization: "<zscore|minmax|robust|unit-norm|none>"

IV. Feature Operators & Postures

Mapping/Construction (feat.map): declare feature function/kernel & hyperparameters; emit feature_space with units/dimensions.
Aggregation (feat.aggregate): sliding/tumbling windows, boundary inclusion, missing policy (fillna); list parameters for multi-window setups.
Joins (feat.join|feat.asof): keys and temporal alignment; for asof, provide tolerance and direction (backward|forward).
Encoding (feat.encode): dict_ref version hash, unk/pad policy, OOV handling, and sparse/dense representation.
Embeddings (feat.embed): vector dimension/metric, index build params & index_ref; include latency/throughput in metrology.
Materialization (feat.materialize): when mode:"cache|persist", declare store location, ttl, eviction & consistency (full/incremental/bypass).

V. Reuse & Dependency Mapping

Cross-task reuse: define reusable sets via feature_view, record consumers[] and version constraints; set compat_mode:"forward|backward|both|break".
Multi-modal reuse: maintain per-modality feature_view and feature_space; register artifacts separately in exports.
Change impact: breaking changes require shadow comparison & dual-write window; threshold breaches trigger rollback or a backward-compat layer.

VI. Consistency & Point-in-Time (PIT) Alignment

PIT: historical replay and online inference use the same alignment rules; lookback/tolerance are identical and locked in configs.
Split consistency: features emitted for training/evaluation must use Dataset Card frozen splits; leakage audits (object/timewindow) are blocking.

VII. Dictionary & Embedding Management

Dictionaries: dict_ref must carry version & hash; admission policy for new categories unknown|reject|map-to-other; provide stable alias keys for frequently changing vocabularies.
Embeddings: version index_ref and record training-data time span; if vectors have physical meaning, declare units in feature_space and pass check_dim.

VIII. Metrology & Units (SI)

Performance: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—); bandwidth net_mbps; storage/index volume size_bytes.
metrology:{units:"SI", check_dim:true} is mandatory; normalize units first before composition/aggregation.
For path-quantity features (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalences below, and pass check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).

IX. Machine-Readable Fragment (Drop-in)

layers:

- name: "feature"

stages:

- name: "feat.map.stats"

type: "feature.map"

impl: "I16-4.feature_map"

inputs: ["std_rows"]

outputs: ["feat_rows"]

params:

key: ["entity_id","ts"]

point_in_time: {enabled:true, lookback:"P30D", tolerance:"PT5M"}

aggregate: {window:"P1D", funcs:["mean","std","count"], fillna:{method:"pad"}}

idempotent: true

schema_ref: "contracts/feat_stats@v1.1"

feature_space: {type:"tabular", shape:"(N,D)", dtype:"float32", normalization:"zscore"}

- name: "feat.encode.cat"

type: "feature.encode"

impl: "I16-4.encode"

inputs: ["feat_rows"]

outputs: ["feat_enc"]

params:

dict_ref: "dicts/category_voc@v2.0"

encode: {vocab_ref:"dicts/category_voc@v2.0", unk:"<UNK>", pad:"<PAD>"}

idempotent: true

schema_ref: "contracts/feat_enc@v1.0"

- name: "feat.materialize"

type: "feature.materialize"

impl: "I16-4.materialize"

inputs: ["feat_enc"]

outputs: ["feat_pkg"]

params:

materialize: {mode:"cache", cache:{ttl:"P7D", max_gb:256}}

idempotent: true

schema_ref: "contracts/feat_pkg@v1.0"

X. Lint Rules (Excerpt, Normative)

lint_rules:

- id: FEAT.FS_REQUIRED

when: "$.layers[*].stages[?(@.type^='feature.')]"

assert: "has_key('feature_space')"

level: error

- id: FEAT.DICT_VERSIONED

when: "$.layers[*].stages[?(@.type=='feature.encode')].params.dict_ref"

assert: "matches('^dicts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: FEAT.PIT_PARAMS

when: "$.layers[*].stages[*].params.point_in_time"

assert: "value.enabled == true -> (has_key('lookback') and has_key('tolerance'))"

level: error

- id: FEAT.MATERIALIZE_POLICY

when: "$.layers[*].stages[?(@.type=='feature.materialize')].params.materialize"

assert: "value.mode in ['none','cache','persist']"

level: error

- id: FEAT.UNITS_CHECKDIM

when: "$.pipeline.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

- id: FEAT.LEAKAGE_GUARDS_FOR_TRAIN_EXPORT

when: "$.layers[*].stages[*].outputs"

assert: "produces_train_eval(outputs) -> has_leakage_guards()"

level: error

XI. Export Manifest & Audit

export_manifest:

version: "v1.0"

artifacts:

- {path:"features/feat_view.yaml", sha256:"..."}

- {path:"features/dict_category_v2.hash", sha256:"..."}

- {path:"features/feat_pkg.manifest.json", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.ModelCards v1.0:Ch.6"

- "EFT.WP.Data.ModelCards v1.0:Ch.9"

XII. Chapter Compliance Checklist

Feature operators type/impl/params complete; feature_space declared and aligned with Model Cards; PIT alignment reproducible.
dict_ref/index_ref versioned & hashed; OOV/admission policies clear; materialization cache defines ttl and size cap.
For training/evaluation outputs, splits match Dataset Card frozen splits; leakage guardrails active.
Performance & units use SI with check_dim=true; if using path quantities T_arr, delta_form/path/measure registered & validated.
export_manifest lists feature view/dictionaries/materialized package artifacts and anchors with sha256, satisfying release gates.

Copyright & License: Unless otherwise stated, the copyright of “Energy Filament Theory” (including text, charts, illustrations, symbols, and formulas) is held by the author (屠广林).
License (CC BY 4.0): With attribution to the author and source, you may copy, repost, excerpt, adapt, and redistribute.
Attribution (recommended): Author: 屠广林｜Work: “Energy Filament Theory”｜Source: energyfilament.org｜License: CC BY 4.0
Call for verification: Independent and self-funded—no employer and no sponsorship. Next, we will prioritize venues that welcome public discussion, public reproduction, and public critique, with no country limits. Media and peers worldwide are invited to organize verification during this window and contact us.
Version info: First published: 2025-11-11 ｜ Current version: v6.0+5.05