Data Engineering | Minimalist Innovation LLC

Bar chart comparing energy consumption: deterministic PPRL pipeline at 11.11 Joules for 10,000 records versus a single Language Model query at 1.55 Joules versus Neural network pairwise matching at 15,480 Joules for the same workload. The generative approach uses about 1,400 times more energy.

Sustainable Entity Resolution: Profiling and Energy Measurement

This fourth post in the entity resolution series closes on sustainability. A deterministic PPRL pipeline using NFKC normalization, HMAC-SHA256, and Jaccard similarity over 1,000-bit Bloom filters measures at 0.000147 gCO₂e per 1,000 records matched—1,400× less carbon than a single LLM query. Profiled with Criterion.rs and Intel RAPL. Rare earth mining, e-waste, and Bornean orangutans make this more than an engineering argument. Write code as if paying for it in joules.

Gandhinath Swaminathan

Mar 278 min read

An infographic showing a secure data flow from three sources: 'Hospital: Disease' (blue), 'Pharmacy: Medication' (green), and 'Lab: Test Results' (purple). The streams of binary data and points, linked to 'Name and DOB' with padlocks, converge into a central 'Honest Broker' server rack. A unified data stream then flows from the server to a 'Unified Analysis' digital dashboard displaying charts and graphs, viewed by a glowing human silhouette on the right.

Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

Privacy-Preserving Record Linkage (PPRL) lets two organizations determine which records refer to the same person — without either party seeing the other's data. Healthcare networks, judicial agencies, and government registries link records using cryptographic Bloom filters and HMAC-keyed hashing. But if strings are not Unicode-normalized before encoding, the hash diverges and matches fail silently. This post shows the full pipeline: q-grams, Bloom filters, CLKs, and normaliza

Gandhinath Swaminathan

Mar 249 min read

Screenshot of a bookstore website search for "aurellion guereon scikit". The search returns four versions of the book "Hands-On Machine Learning". Beneath the covers, the author's name, Aurélien Géron, is listed with different text formatting issues across the results, including "Aurelien Geron" (missing accents) and "G?ron Aur?lien" (showing question marks instead of accented characters)

The hidden complexity of text: A close look at Unicode normalization for entity resolution

Entity resolution runs on joins, comparisons, and blocking keys that assume “same-looking text” means “same bytes.” That assumption is wrong once you move outside ASCII. To solve this encoding chaos, data architectures must standardize text at the lowest possible layer: the Unicode byte structure. A normalization strategy that's too weak misses matches; one that's too aggressive destroys signal and creates false positives.

Gandhinath Swaminathan

Mar 1311 min read

Wide warehouse aisle with a wrapped pallet in the center and three glowing tags above it showing “ORG HUMMUS” lots with different expiration dates (May 20, June 5, and July 10).

The Invisible Wall Blocking Your FEFO (First Expired, First Out) Strategy

FEFO (First Expired, First Out) only works when your systems recognize one product as one product. In CPG networks, the same SKU can arrive from plants, co-packers, and distributors under different item codes—creating phantom inventory, missed rotation, and expired stock. The result is predictable: retailer rejections, chargebacks, higher freight, and write-offs. Stabilize product identity with governed data and high-volume entity resolution.

Gandhinath Swaminathan

Feb 105 min read

Orchestration of Identity: Turning Algorithms into a Well-tuned Arrangement

Your data thinks Coca-Cola Zero Sugar 12oz and Coke Zero 12 Pack are different products. Healthcare systems can't tell if two patient records refer to the same person. Banks miss money laundering patterns hidden in ownership networks. The algorithms exist—BM25, HNSW, SPLADE, Graph Transformers—but knowing when to use them is the hard part. This framework shows you how to sequence matching algorithms into production entity resolution systems tailored to your domain's risk prof

Gandhinath Swaminathan

Jan 268 min read

Why Probabilistic Record Linkage Still Matters

Probabilistic record linkage still matters because identity data is messy and match decisions carry real financial and compliance risk. This article explains the intuition behind Fellegi–Sunter and Bayesian record linkage, shows how they control false merges and splits across noisy customer and product records, and points to modern tools and books that help you put these ideas into practice.

Gandhinath Swaminathan

Jan 225 min read

Heterogeneous knowledge graph diagram showing product entity resolution with typed nodes (mentions, organizations, models, attributes) connected by colored relationship edges (madeby, hasmodel, hasattr). Multiple convergent paths highlighted between two mentions, illustrating multi-hop reasoning for entity matching.

Heterogeneous Knowledge Graphs: Multi-Hop Reasoning Beyond Pairwise Matching

Pairwise matching treats each comparison as a one-off. A persistent knowledge graph turns product mentions, manufacturers, model numbers, attributes, and price bins into typed nodes and relations. Matching becomes neighborhood comparison: multi-hop paths (convergent evidence) can beat any single similarity score.

Gandhinath Swaminathan

Jan 227 min read

Feature illustration of a Sony PS‑LX350H turntable with SPLADE token weights on the left and a token‑to‑token attention graph on the right, showing sparse retrieval turning into an entity-resolution decision.

From Inverted Index to Attention Graph: Turning SPLADE Tokens Into ER Decisions

False entity merges don’t just dirty data. They distort inventory, pricing, and forecasts, then every model and report built on top. Learned sparse retrieval improves recall, but it can still treat records like unordered tokens. This post adds token-to-token attention as a structural check so near-duplicates pass and lookalikes fail, with a trail you can audit.

Gandhinath Swaminathan

Jan 213 min read

Warehouse worker scanning package labels with a handheld barcode scanner.

When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25

BM25 looks old on paper, but it still decides which records are worth comparing when identifiers can’t afford to be “almost” right. This post walks through the TF‑IDF roots of BM25, how k1 and b shape the scoring curve, and why Lucene, Elasticsearch, and OpenSearch still rely on it. You’ll see how term statistics, not embeddings, keep product codes, SKUs, and customer records anchored during entity resolution.

Gandhinath Swaminathan

Jan 85 min read

Sustainable Entity Resolution: Profiling and Energy Measurement

Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

The hidden complexity of text: A close look at Unicode normalization for entity resolution

The Invisible Wall Blocking Your FEFO (First Expired, First Out) Strategy

Orchestration of Identity: Turning Algorithms into a Well-tuned Arrangement

Why Probabilistic Record Linkage Still Matters

Heterogeneous Knowledge Graphs: Multi-Hop Reasoning Beyond Pairwise Matching

From Inverted Index to Attention Graph: Turning SPLADE Tokens Into ER Decisions

When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25

Domain Modeling for Agentic AI: Customer 360 as a Semantic Problem

Sustainable Entity Resolution: Profiling and Energy Measurement