Home / Docs-Technical WhitePaper / 16-EFT.WP.Methods.Cleaning v1.0
Chapter 9 Deduplication, Association, and Referential Integrity
One-Sentence Goal
Perform deduplication, entity association, and referential-integrity checks on SRef standard data, and output a duplicate-free, orphan-free, traceable, and auditable D_link together with manifest.relation.
I. Scope & Objects
- Processing targets
- Input: dataset D_clean.pre_link after Chapters 3–8, including key and index fields pk, rid, uid, sid, tid, pid, idx_k, business keys k_biz, foreign-key set FK = {fk_i}.
- Time & path: ts, tau_mono, ell; quality & uncertainty: q_score, u(x); arrival-time conventions: T_arr, delta_form.
- Target artifacts
Output: D_link = D_clean.pre_link ⊕ {equiv_map, fk_repair, assoc}, plus report_link and manifest.relation (policies, thresholds, and blast radius).
II. Terms & Variables (Memory Anchors)
- Records & keys
- Primary key: pk; business key: k_biz; candidate keys: K_cand = {k_1,...,k_m}; foreign key: fk : child → parent.pk.
- Blocking key: b(x) (rules for forming candidate pairs).
- Equivalence & representatives
- Similarity: sim(x_i, x_j) ∈ [0,1]; threshold: tau_sim ∈ (0,1].
- Equivalence class: E = {x | x ≡ x*}; representative: rep(E); merge operator: merge(E, policy).
- Equivalence map: equiv_map : pk_old → pk_new = rep(E).pk.
- Associations & referential links
- Association table: assoc(A,B) (many-to-many bridge); primary key pk_assoc = hash_sha256( A.pk || B.pk || role ).
- Orphan: orphan(child) = 1_{ ¬∃ parent : child.fk = parent.pk }.
III. Axioms (P109-*)
- P109-01 Uniqueness
unique(pk) must hold; after deduplication, unique(k_biz) must hold within its declared scope. - P109-02 No silent drops
All deduplication and referential repairs must be explicitly recorded in equiv_map and fk_repair; unrecorded deletions are forbidden. - P109-03 Reproducibility & determinism
Given the same input and policy, outputs of deduplicate and rekey_children are deterministic. - P109-04 Time-base & path conservation
Merges must not violate non_decreasing(ts) or non_decreasing(ell); arrival-time conventions remain consistent, delta_form is re-evaluated post-merge and thresholds are not relaxed. - P109-05 Dimension & unit conservation
Post-merge metric fields satisfy check_dim( y - f(x) ); units must not change implicitly.
IV. Minimal Equations (S109-*)
- S109-01 Candidate-pair generation (blocking)
pair_cand = { (i,j) | b(x_i) = b(x_j) ∧ |tau_mono_i - tau_mono_j| ≤ Delta_t } - S109-02 Weighted similarity
sim_w(i,j) = ( sum_f w_f * s_f(i,j) ) / ( sum_f w_f )
where:- String fields: s_str = Jaccard( tokens(i), tokens(j) ) or s_str = 1 - Hamming / len
- Numeric fields: s_num = max( 0 , 1 - |x_i - x_j| / tol_f )
- Categorical fields: s_cat = 1_{x_i = x_j}
- S109-03 Duplicate decision
dup(i,j) = ( sim_w(i,j) ≥ tau_sim ) ∧ key_compat(i,j)
key_compat requires no conflicts on K_cand (e.g., same source sid, same role). - S109-04 Equivalence classes & representatives
Obtain E_k via connected components on the duplicate graph; representative selection:
rep(E_k) = argmax_{x ∈ E_k} ( q_score(x) , -u(x) , recency(ts) ) (lexicographic). - S109-05 Field-merge policies
- Numeric: merge_num = mean_w( {x_t}, w_t = 1 / u(x_t)^2 ) or median (robust).
- Text: merge_str = longest_nonempty or tfidf_merge.
- Time: ts_min = min(ts_t); ts_max = max(ts_t); publish ts = ts_max.
- Quality: q_score' = max( q_score_t ) * penalty(conflict).
- S109-06 Foreign-key remapping
fk_new(child) = equiv_map( fk_old(child) ) (if fk_old was merged).
Referential assertion: ∀ child, ∃ parent : fk_new(child) = parent.pk. - S109-07 Orphan detection
orphan_rate = mean( 1_{ orphan(child)=true } ); orphan = ( fk_new ∉ Parent.pk ). - S109-08 Bridge-table deduplication
dup_assoc = 1_{ set( A.pk , B.pk , role ) 已出现 }; keep the earliest tau_mono or the highest q_score.
V. Cleaning Process (M10-9 Deduplication, Association, and Referential Integrity)
- Load constraints & policies
From DataSpec, load keyspaces, FK topology, and declared k_biz-uniqueness domains; set b(x), Delta_t, tau_sim, and merge.policy. - Generate candidates & scores
Form pair_cand using b(x) and Delta_t; compute sim_w and the dup(i,j) flags. - Build equivalence classes & pick representatives
Compute connected components {E_k} on the duplicate graph; select rep(E_k) per S109-04; form equiv_map. - Merge & field resolution
For each E_k, merge per S109-05; preserve tau_mono monotonicity; update q_score' and propagate uncertainties. - FK remap & orphan handling
policy = {drop|quarantine|link_placeholder}; if link_placeholder, create parent_stub with m=1 and down-weight q_score.Apply S109-06 across all children; compute orphans: - Bridge and many-to-many dedup
On assoc(A,B), deduplicate per S109-08; verify symmetry and forbid self-loops as required. - Contract checks & rollback
Enforce unique(pk), unique(k_biz), and assert_fk; on failure, roll back to prior snapshot or a degraded policy (tag-only, no publish). - Persistence & audit
Write manifest.relation = { b, Delta_t, tau_sim, merge.policy, equiv_map_digest, fk_repair_stats }; emit report_link, signature, and hash_sha256(blob).
VI. Contracts & Assertions
- Uniqueness
unique(pk) = true; within the declared domain, unique(k_biz) = true. - Referential integrity
fk_valid = mean( 1_{ fk_new ∈ Parent.pk } ) = 1; orphan_rate = 0 (or ≤ tol_orphan with quarantine). - Dedup consistency
∀ pk_old ∈ equiv_map, equiv_map(pk_old) ≠ pk_old and the mapping is acyclic. - Dimension conservation
check_dim( merged_fields ) = true. - Time & path conservation
non_decreasing(ts) and non_decreasing(ell) still hold; after recomputation, delta_form ≤ tol_Tarr. - Audit completeness
exists(manifest.relation) and it includes equiv_map_digest and fk_repair_stats.
VII. Implementation Binding (I10-9)
- Interface prototypes
- deduplicate(ds, keys, semantics) -> ds', equiv_map, report
- resolve_conflicts(ds, equiv_map, policy) -> ds''
- rekey_children(ds_child, equiv_map, fk) -> ds_child'
- assert_fk(ds_parent, ds_child, fk) -> report_fk
- drop_orphan(ds_child, fk, policy) -> ds_child', orphan_report
- dedup_assoc(assoc, rule) -> assoc'
- Preconditions
Chapter 3 schema binding complete; Chapter 4 dimension consistency; Chapter 5 time-base alignment; Chapters 7–8 explicit missingness and anomaly tagging. - Invariants & postconditions
Do not change the type or semantics of pk; equiv_map and fk_repair are replayable; child tables preserve ts/ell monotonicity. - Failure semantics
E_FK_CYCLE (FK rewrite introduces a cycle), E_UNIQUE_BREACH, E_POLICY_CONFLICT, E_PLACEHOLDER_FORBIDDEN.
VIII. Cross-References
- Schema & keyspace: Chapter 3.
- Unit and dimension conservation: Chapter 4.
- Time-base and arrival-time checks: Chapters 5–6.
- Down-weighted merges for missingness and anomalies: Chapters 7–8.
- Release gate & contracts: Chapter 10.
- Streaming idempotence & back-pressure: Chapter 11.
IX. Quality Metrics & Risk Control
- Indicators
- dup_rate = 1 - |unique(pk_before)| / |pk_before|
- biz_dup_rate = violation_rate(k_biz)
- orphan_rate, fk_repair_share
- mapping_entropy (distribution of equivalence-class sizes)
- record_churn = |changed_rows| / |rows|
- P99(rekey_latency) and idempotency_fail_rate
- Alert guardrails
- If dup_rate > tol_dup → tighten tau_sim or strengthen b(x)
- If orphan_rate > 0 → escalate assert_fk alarms and roll back
- If record_churn > tol_churn → policy audit and human review
X. Boundaries & Special Cases
- Multi-keys for the same entity across sources
Use composite key pk_src = hash_sha256( source || pk_native ), then deduplicate at the aggregation tier into a global pk. - Late arrivals & reordering
late_record triggers incremental dedup: update equiv_map only within sliding window W_inc and version the map. - Temporal foreign keys
If an FK has validity interval [t_from, t_to), the referential assertion is
fk_valid = 1_{ ∃ parent : fk=parent.pk ∧ ts ∈ [t_from,t_to) }. - Legitimate duplicates
For k_biz domains declared as allowing duplicates, tag only: tag = legit_duplicate.
XI. Audit & Panel Fields
- dup_rate, biz_dup_rate, equiv_classes_{p50,p95,max}
- fk_violation_count, orphan_rate, fk_repair_share
- record_churn, rekey_latency_p99, idempotency_fail_rate
- equiv_map_digest, policy_version, signature, hash_sha256(blob)
Summary
This chapter establishes a unified framework for deduplication, association, and referential integrity: use blocking and similarity to construct equivalence classes; select representatives by quality and uncertainty; perform reproducible merges; deterministically remap foreign keys and handle orphans; finally enforce contracts and panel metrics to guarantee uniqueness, referential completeness, and auditable replayability—providing compliant inputs for Chapter 10’s release freeze.
Copyright & License (CC BY 4.0)
Copyright: Unless otherwise noted, the copyright of “Energy Filament Theory” (text, charts, illustrations, symbols, and formulas) belongs to the author “Guanglin Tu”.
License: This work is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, excerpt, adapt, and share for commercial or non‑commercial purposes with proper attribution.
Suggested attribution: Author: “Guanglin Tu”; Work: “Energy Filament Theory”; Source: energyfilament.org; License: CC BY 4.0.
First published: 2025-11-11|Current version:v5.1
License link:https://creativecommons.org/licenses/by/4.0/