Entity Resolution | Minimalist Innovation LLC

Bar chart comparing energy consumption: deterministic PPRL pipeline at 11.11 Joules for 10,000 records versus a single Language Model query at 1.55 Joules versus Neural network pairwise matching at 15,480 Joules for the same workload. The generative approach uses about 1,400 times more energy.

Sustainable Entity Resolution: Profiling and Energy Measurement

This fourth post in the entity resolution series closes on sustainability. A deterministic PPRL pipeline using NFKC normalization, HMAC-SHA256, and Jaccard similarity over 1,000-bit Bloom filters measures at 0.000147 gCO₂e per 1,000 records matched—1,400× less carbon than a single LLM query. Profiled with Criterion.rs and Intel RAPL. Rare earth mining, e-waste, and Bornean orangutans make this more than an engineering argument. Write code as if paying for it in joules.

Gandhinath Swaminathan

Mar 278 min read

A detailed description of a futuristic digital illustration. At the center, a stylized globe of Earth is formed by interconnected data nodes and streams. The text "ICU" and "I18N" in glowing neon letters are prominently displayed in the center. Below the "I18N" text is the bold main title: "UNLOCKING GLOBAL MARKETS." Below that, a smaller subtitle reads: "The Essential Guide to International Components for Unicode & Internationalization." A larger glowing ring of multicolored text containing diverse world scripts (like Arabic, Chinese, Japanese, and Cyrillic) and the Unicode symbol encircles the globe. The background features faint outlines of a world map and server racks in a data center, all in a blue, purple, and orange color palette.

The infrastructure behind global text: I18N, ICU, and why Rust does it differently

Encoding fragmentation breaks entity resolution pipelines. A sixty-year-old Unicode anomaly fractures entity graphs and drops database joins. We examine the root cause of these byte-level mismatches. We detail the mechanics of internationalization (i18n) engineering, the evolution of the ICU library family, and how Rust's ICU4X datagen architecture handles multilingual text with zero-copy deserialization and compiler-enforced memory safety.

Gandhinath Swaminathan

Mar 257 min read

An infographic showing a secure data flow from three sources: 'Hospital: Disease' (blue), 'Pharmacy: Medication' (green), and 'Lab: Test Results' (purple). The streams of binary data and points, linked to 'Name and DOB' with padlocks, converge into a central 'Honest Broker' server rack. A unified data stream then flows from the server to a 'Unified Analysis' digital dashboard displaying charts and graphs, viewed by a glowing human silhouette on the right.

Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

Privacy-Preserving Record Linkage (PPRL) lets two organizations determine which records refer to the same person — without either party seeing the other's data. Healthcare networks, judicial agencies, and government registries link records using cryptographic Bloom filters and HMAC-keyed hashing. But if strings are not Unicode-normalized before encoding, the hash diverges and matches fail silently. This post shows the full pipeline: q-grams, Bloom filters, CLKs, and normaliza

Gandhinath Swaminathan

Mar 249 min read

Screenshot of a bookstore website search for "aurellion guereon scikit". The search returns four versions of the book "Hands-On Machine Learning". Beneath the covers, the author's name, Aurélien Géron, is listed with different text formatting issues across the results, including "Aurelien Geron" (missing accents) and "G?ron Aur?lien" (showing question marks instead of accented characters)

The hidden complexity of text: A close look at Unicode normalization for entity resolution

Entity resolution runs on joins, comparisons, and blocking keys that assume “same-looking text” means “same bytes.” That assumption is wrong once you move outside ASCII. To solve this encoding chaos, data architectures must standardize text at the lowest possible layer: the Unicode byte structure. A normalization strategy that's too weak misses matches; one that's too aggressive destroys signal and creates false positives.

Gandhinath Swaminathan

Mar 1311 min read

Enterprise data architecture diagram illustrating the distinct operational layers of Entity Resolution, Identity Resolution, and Identity Management transforming fragmented data into a unified golden record for agentic workflows.

The Strategic Framework for Modern Identity: Decoding ER, IR, and IM for the Enterprise

Accelerating toward agentic workflows unlocks unprecedented enterprise speed—but demands a flawless foundation of identity correctness. To engineer an architecture where automation acts with absolute confidence, technical leaders must master the precise distinctions between Entity Resolution, Identity Resolution, and Identity Management. Discover the definitive taxonomic framework to transform ambiguous data into proactive, self-sustaining identity intelligence for your compo

Gandhinath Swaminathan

Mar 77 min read

Wide warehouse aisle with a wrapped pallet in the center and three glowing tags above it showing “ORG HUMMUS” lots with different expiration dates (May 20, June 5, and July 10).

The Invisible Wall Blocking Your FEFO (First Expired, First Out) Strategy

FEFO (First Expired, First Out) only works when your systems recognize one product as one product. In CPG networks, the same SKU can arrive from plants, co-packers, and distributors under different item codes—creating phantom inventory, missed rotation, and expired stock. The result is predictable: retailer rejections, chargebacks, higher freight, and write-offs. Stabilize product identity with governed data and high-volume entity resolution.

Gandhinath Swaminathan

Feb 105 min read

Illustration comparing a neat prototype entity resolution model with a complex, messy production data graph.

Benchmarking & Datasets for Entity Resolution

A practical guide to benchmarking entity resolution (ER) systems. It covers commonly used public datasets, explains which evaluation metrics are informative in ER (and why accuracy can mislead), and outlines how to design a domain-specific test set so results are meaningful for production decisions.

Gandhinath Swaminathan

Jan 268 min read

Orchestration of Identity: Turning Algorithms into a Well-tuned Arrangement

Your data thinks Coca-Cola Zero Sugar 12oz and Coke Zero 12 Pack are different products. Healthcare systems can't tell if two patient records refer to the same person. Banks miss money laundering patterns hidden in ownership networks. The algorithms exist—BM25, HNSW, SPLADE, Graph Transformers—but knowing when to use them is the hard part. This framework shows you how to sequence matching algorithms into production entity resolution systems tailored to your domain's risk prof

Gandhinath Swaminathan

Jan 268 min read

Why Probabilistic Record Linkage Still Matters

Probabilistic record linkage still matters because identity data is messy and match decisions carry real financial and compliance risk. This article explains the intuition behind Fellegi–Sunter and Bayesian record linkage, shows how they control false merges and splits across noisy customer and product records, and points to modern tools and books that help you put these ideas into practice.

Gandhinath Swaminathan

Jan 225 min read

Heterogeneous knowledge graph diagram showing product entity resolution with typed nodes (mentions, organizations, models, attributes) connected by colored relationship edges (madeby, hasmodel, hasattr). Multiple convergent paths highlighted between two mentions, illustrating multi-hop reasoning for entity matching.

Heterogeneous Knowledge Graphs: Multi-Hop Reasoning Beyond Pairwise Matching

Pairwise matching treats each comparison as a one-off. A persistent knowledge graph turns product mentions, manufacturers, model numbers, attributes, and price bins into typed nodes and relations. Matching becomes neighborhood comparison: multi-hop paths (convergent evidence) can beat any single similarity score.

Gandhinath Swaminathan

Jan 227 min read

Feature illustration of a Sony PS‑LX350H turntable with SPLADE token weights on the left and a token‑to‑token attention graph on the right, showing sparse retrieval turning into an entity-resolution decision.

From Inverted Index to Attention Graph: Turning SPLADE Tokens Into ER Decisions

False entity merges don’t just dirty data. They distort inventory, pricing, and forecasts, then every model and report built on top. Learned sparse retrieval improves recall, but it can still treat records like unordered tokens. This post adds token-to-token attention as a structural check so near-duplicates pass and lookalikes fail, with a trail you can audit.

Gandhinath Swaminathan

Jan 213 min read

The Best of Both Worlds: Learned Sparse Retrieval (SPLADE) For Entity Resolution

Entity resolution breaks when exact matching is too brittle and dense vectors blur identities. This post introduces SPLADE, a learned sparse retrieval model that keeps inverted indexes and token-level explainability while adding transformer-powered expansion and reweighting. We walk through where SPLADE beats BM25 and dense search, where it can fail on SKUs and over-expansion, and how to run it in Postgres/ParadeDB for large-scale product, customer, or patient identity.

Gandhinath Swaminathan

Jan 2110 min read

Abstract visualization showing geometric lexical data patterns merging with flowing semantic vector networks, with particles fusing at the convergence point, representing hybrid search combining BM25 and vector similarity.

Hybrid Search and Reciprocal Rank Fusion: Building the Bridge Between Lexical and Semantic

Entity resolution struggles when systems must choose between the rigid precision of BM25 and the fuzzy flexibility of Vector Search. Part 4 reveals why simple linear weighting fails and introduces Reciprocal Rank Fusion (RRF) as the superior alternative. We explore the architectural shift to Hybrid Search, demonstrating how to merge rank positions rather than raw scores using Spring Boot and ParadeDB.

Gandhinath Swaminathan

Jan 147 min read

Warehouse worker scanning package labels with a handheld barcode scanner.

When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25

BM25 looks old on paper, but it still decides which records are worth comparing when identifiers can’t afford to be “almost” right. This post walks through the TF‑IDF roots of BM25, how k1 and b shape the scoring curve, and why Lucene, Elasticsearch, and OpenSearch still rely on it. You’ll see how term statistics, not embeddings, keep product codes, SKUs, and customer records anchored during entity resolution.

Gandhinath Swaminathan

Jan 85 min read

Infographic timeline showing the evolution from Linked Lists to Skip Lists, Small World Graphs, and HNSW, highlighting the progression in search efficiency and method.

How Data Structures Build the Bridge from Exact Matching to Semantic Search

Exact match is easy. Similarity is hard. This post climbs the ladder of structures that make vector lookups fast: linked lists (slow scans), skip lists (express lanes), small-world graphs, and HNSW. Then it shows how pgvector brings HNSW into PostgreSQL so entity resolution can happen where your records already live.

Gandhinath Swaminathan

Jan 58 min read

Diagram showing a single Sony turntable model with three conflicting names and SKU codes as it appears across CRM, inventory management, and pricing systems, illustrating how product fragmentation creates mismatched records.

How One Invisible Data Problem Quietly Destroys Your Churn Models, Your Pricing, and Your AI Agents

Healthcare providers track the same patient under five name variations. Retailers can't tell when the same SKU is under two different codes. CPG companies buy demand data showing one product with three different names across channels. Supply chains have suppliers that are actually the same company. Every week. Same problem. Different domain. Your data doesn't know what it's describing.

Gandhinath Swaminathan

Jan 26 min read