top of page


The Invisible Wall Blocking Your FEFO (First Expired, First Out) Strategy
FEFO (First Expired, First Out) only works when your systems recognize one product as one product. In CPG networks, the same SKU can arrive from plants, co-packers, and distributors under different item codes—creating phantom inventory, missed rotation, and expired stock. The result is predictable: retailer rejections, chargebacks, higher freight, and write-offs. Stabilize product identity with governed data and high-volume entity resolution.

Gandhinath Swaminathan
Feb 105 min read


Benchmarking & Datasets for Entity Resolution
A practical guide to benchmarking entity resolution (ER) systems. It covers commonly used public datasets, explains which evaluation metrics are informative in ER (and why accuracy can mislead), and outlines how to design a domain-specific test set so results are meaningful for production decisions.

Gandhinath Swaminathan
Jan 268 min read


Orchestration of Identity: Turning Algorithms into a Well-tuned Arrangement
Your data thinks Coca-Cola Zero Sugar 12oz and Coke Zero 12 Pack are different products. Healthcare systems can't tell if two patient records refer to the same person. Banks miss money laundering patterns hidden in ownership networks. The algorithms exist—BM25, HNSW, SPLADE, Graph Transformers—but knowing when to use them is the hard part. This framework shows you how to sequence matching algorithms into production entity resolution systems tailored to your domain's risk prof

Gandhinath Swaminathan
Jan 268 min read


Why Probabilistic Record Linkage Still Matters
Probabilistic record linkage still matters because identity data is messy and match decisions carry real financial and compliance risk. This article explains the intuition behind Fellegi–Sunter and Bayesian record linkage, shows how they control false merges and splits across noisy customer and product records, and points to modern tools and books that help you put these ideas into practice.

Gandhinath Swaminathan
Jan 225 min read


Heterogeneous Knowledge Graphs: Multi-Hop Reasoning Beyond Pairwise Matching
Pairwise matching treats each comparison as a one-off. A persistent knowledge graph turns product mentions, manufacturers, model numbers, attributes, and price bins into typed nodes and relations. Matching becomes neighborhood comparison: multi-hop paths (convergent evidence) can beat any single similarity score.

Gandhinath Swaminathan
Jan 227 min read


From Inverted Index to Attention Graph: Turning SPLADE Tokens Into ER Decisions
False entity merges don’t just dirty data. They distort inventory, pricing, and forecasts, then every model and report built on top. Learned sparse retrieval improves recall, but it can still treat records like unordered tokens. This post adds token-to-token attention as a structural check so near-duplicates pass and lookalikes fail, with a trail you can audit.

Gandhinath Swaminathan
Jan 213 min read


The Best of Both Worlds: Learned Sparse Retrieval (SPLADE) For Entity Resolution
Entity resolution breaks when exact matching is too brittle and dense vectors blur identities. This post introduces SPLADE, a learned sparse retrieval model that keeps inverted indexes and token-level explainability while adding transformer-powered expansion and reweighting. We walk through where SPLADE beats BM25 and dense search, where it can fail on SKUs and over-expansion, and how to run it in Postgres/ParadeDB for large-scale product, customer, or patient identity.

Gandhinath Swaminathan
Jan 2110 min read


Hybrid Search and Reciprocal Rank Fusion: Building the Bridge Between Lexical and Semantic
Entity resolution struggles when systems must choose between the rigid precision of BM25 and the fuzzy flexibility of Vector Search. Part 4 reveals why simple linear weighting fails and introduces Reciprocal Rank Fusion (RRF) as the superior alternative. We explore the architectural shift to Hybrid Search, demonstrating how to merge rank positions rather than raw scores using Spring Boot and ParadeDB.

Gandhinath Swaminathan
Jan 147 min read


When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25
BM25 looks old on paper, but it still decides which records are worth comparing when identifiers can’t afford to be “almost” right. This post walks through the TF‑IDF roots of BM25, how k1 and b shape the scoring curve, and why Lucene, Elasticsearch, and OpenSearch still rely on it. You’ll see how term statistics, not embeddings, keep product codes, SKUs, and customer records anchored during entity resolution.

Gandhinath Swaminathan
Jan 85 min read


How Data Structures Build the Bridge from Exact Matching to Semantic Search
Exact match is easy. Similarity is hard. This post climbs the ladder of structures that make vector lookups fast: linked lists (slow scans), skip lists (express lanes), small-world graphs, and HNSW. Then it shows how pgvector brings HNSW into PostgreSQL so entity resolution can happen where your records already live.

Gandhinath Swaminathan
Jan 58 min read


How One Invisible Data Problem Quietly Destroys Your Churn Models, Your Pricing, and Your AI Agents
Healthcare providers track the same patient under five name variations. Retailers can't tell when the same SKU is under two different codes. CPG companies buy demand data showing one product with three different names across channels. Supply chains have suppliers that are actually the same company. Every week. Same problem. Different domain. Your data doesn't know what it's describing.

Gandhinath Swaminathan
Jan 26 min read
bottom of page