HomeDocs-Technical WhitePaper06-EFT.WP.Core.DataSpec v1.0

Chapter 6 — Partitioning, Indexing, and Query


I. Scope and Objects


II. Terminology and Objectives

  1. Objective: minimize scan volume and I/O amplification, maximize pruning and hit-rate, and guarantee determinism for pk and contract-bound queries.
  2. Terms:
    • K = [k1,k2,...,kd] (partition key order), B_i = card(partition on ki).
    • sel(P) = |{ r ∈ R : P(r) }| / N (selectivity).
    • C_proj (projected column set), alpha = |C_proj| / |Fields_total| (projection ratio).

III. Postulates (P66-*)


IV. Partitioning Strategies and Naming


V. Cost Model for Partition-Key Selection (S66-1)

  1. With top-to-depth d partition counts B = [B1,...,Bd] and predicate match fractions f = [f1,...,fd],
  2. Under approximate independence (declare approx independence):
    • E[parts(P)] approx prod_{i=1..d} ceil( fi * Bi ).
    • V_scan approx E[parts(P)] * avg_bytes_per_part * ( alpha / ratio_compress ).
    • L approx seek_cost * E[parts(P)] + V_scan / throughput_io.
  3. Key selection rule: over candidates, minimize E[parts] and L, while constraining each B_i not to exceed the target metadata overhead.

VI. Time-Series Partitioning


VII. Space and Path Partitioning

  1. Space:
    • gh = geohash(lon,lat,p) with tile angular scale tile_deg ≈ 180 / 2^p; choose p from target spatial selectivity and hotspot distribution.
    • Declare CRS and use float64 for lon/lat/alt; normalize across CRS before pushing predicates.
  2. Path gamma(ell):
    • Partition with K = [pid, ell_bucket], and align row groups to contiguous ell intervals;
    • For T_arr datasets, retain both formulations and the delta_form field, enabling segment-level aggregation.

VIII. Directory and File Naming (Aligned with manifest)


IX. Index Types and Storage Bindings

  1. Primary index:
    Clustered pk (ordered by pk or hash-bucketed), guaranteeing O(log N) or O(1) bucket location for equality lookups.
  2. Secondary indexes (idx_k):
    • B+tree: range queries and ordered scans (ts, ell, numeric intervals).
    • hash: high-cardinality equality (uid, rid).
    • bitmap: low-cardinality dims (sid, quality labels, enums); supports bitwise composite predicates.
    • inverted: text and multivalue tags (tags[]).
  3. Columnar auxiliaries:
    • Zone maps: [min_i,max_i]; prune if disjoint with predicate intervals.
    • Bloom filters: enable on hot columns; false-positive rate p_fp approx ( 1 - exp( -k * n_elem / m_bits ) )^k.
    • Dictionary & RLE: amplify bitmap and range-pruning effectiveness.

X. Composite Indexes and Prefix Rules (S66-2)


XI. Query Specification and Pushdown Order

  1. Predicate normal form: P = CNF( conj_i disj_{i,j} pred_{i,j} ); normalize field and constant units via unit(dim).
  2. Pushdown order:
    • Partition pruning (directory keys & manifest.stats);
    • Column segment pruning (zone maps & Bloom);
    • Secondary index lookups (idx_k hits);
    • Row filtering and applying m=1 mask;
    • Projection C_proj and aggregation.
  3. Quality filter convention: WHERE m = 1 AND q_score >= q_min.

XII. Canonical Query Templates

  1. pk equality: SELECT * WHERE pk = rid → idx(pk) locate → single-row return.
  2. Time window: SELECT C_proj WHERE ts ∈ [t0,t1) → partition prune date/hour → zone map → row filter.
  3. Spatial window: SELECT * WHERE gh IN {tiles} AND CRS = ref → tile-set pruning → secondary index or row filter.
  4. Path segment: SELECT C_proj WHERE pid = p AND ell ∈ [a,b) → prune with K=[pid,ell_bucket] → B+tree (pid,ell) scan.
  5. T_arr segment aggregation:
    • T_arr(factored) = ( 1 / c_ref ) * ( ∑ over rows n_eff * Δell );
    • T_arr(general) = ( ∑ over rows ( n_eff / c_ref ) * Δell );
    • delta_form = | T_arr(factored) - T_arr(general) | and assert delta_form ≤ tol_Tarr.

XIII. Consistency and Contract Checks


XIV. Workflow Mx-4 (Partitioning and Index Construction)


XV. Runtime Parameters and Baseline Suggestions


XVI. Implementation Bindings (Aligned with I60)


XVII. Manifest Extension Fields (Partitioning & Indexing)


XVIII. Change and Compatibility


Copyright & License (CC BY 4.0)

Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.

First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/