Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

Gandhinath Swaminathan
6 days ago
9 min read

Updated: 4 days ago

If you are trying to link sensitive data across organizational boundaries, standard hashing and basic encryption are no longer enough to protect you from compliance risks or guarantee accurate matches.

To solve this, you need Privacy-Preserving Record Linkage (PPRL), and this post breaks down exactly how to architect it to match normalized strings across two organizations without either side seeing the other's data.

This Post At a Glance

The Thread: 
Part 1 Blog post "The hidden complexity of text: A close look at Unicode normalization for entity resolution" established that strings are not reliable until normalized at the byte layer.
Part 2 Blog post titled "The infrastructure behind global text: I18N, ICU, and why Rust does it differently" explains how ICU4X enforces those rules in production with zero-copy internationalization. 
This blog post is an integral part of our 4-part series on Unicode, I18N, and Entity Resolution.

The problem: Healthcare networks, judicial agencies, and government registries hold records about the same people. Linking those records drives research, policy, and safety outcomes. Sharing the raw names and identifiers to do so is illegal under HIPAA and GDPR and creates irreversible privacy risk.

The solution class: Privacy-Preserving Record Linkage (PPRL) allows two parties to determine which of their records refer to the same real-world entity while learning nothing about each other's data beyond the linkage result.

The mechanism: Encode each identifier into a cryptographic bit vector — a Bloom filter — using HMAC-keyed hash functions. Exchange bit vectors. Compute similarity. Never exchange plaintext.

The catch: If the strings entering that cryptographic pipeline are not byte-identical due to Unicode normalization differences, the hash diverges completely. Everything we covered in part 1 and part 2 in this series is a prerequisite for what follows here.

An infographic showing a secure data flow from three sources: 'Hospital: Disease' (blue), 'Pharmacy: Medication' (green), and 'Lab: Test Results' (purple). The streams of binary data and points, linked to 'Name and DOB' with padlocks, converge into a central 'Honest Broker' server rack. A unified data stream then flows from the server to a 'Unified Analysis' digital dashboard displaying charts and graphs, viewed by a glowing human silhouette on the right. — ***An Honest Broker securely integrates sensitive data from many healthcare sectors to create a unified dashboard.***

Why Privacy-Preserving Record Linkage (PPRL) Exists

Two organizations could link records by exchanging names, birthdates, and addresses directly. This works technically, but fails legally.

Data integration often happens across organizational boundaries governed by strict data protection laws. Sharing plaintext PII across those boundaries violates HIPAA, GDPR, and a range of sector-specific regulations.

The goal of PPRL is formally stated by Vatsalan, Christen, and Verykios in their landmark taxonomy: link records from multiple organizations without any organization learning anything about the other's data beyond the linkage result. This is an instance of the Secure Multi-Party Computation problem — computing a shared function over private inputs while revealing nothing except the output.

Data quality compounds the challenge. The same person's name may appear as "João Silva," "Joao Silva," and "J. Silva" depending on the system, encoding, and input locale. Marks, case, and normalization vary across languages, making byte-level equality an unreliable identity test. A system that solves both problems simultaneously — cryptographic privacy and approximate string matching — is what PPRL delivers.

Healthcare: Longitudinal Patient Discovery

Hospital networks often need to combine patient histories across multiple sites for clinical trials. Sharing raw patient rosters violates HIPAA's Minimum Necessary standard. PPRL removes that barrier.

Each site independently encodes patient quasi-identifiers — first name, surname, date of birth, ZIP code — into cryptographic bit vectors. No plaintext is exchanged. Both sites send their encoded vectors to an independent Honest Broker. The Honest Broker compares the vectors to find matches but lacks the cryptographic keys to decode the underlying identifiers. It returns a list of matched record pairs without ever seeing patient identities.

Major organizations — including the NIH N3C, the CDC, and PCORnet — use this exact model to securely link millions of patient records across institutions.

The Public Sector: Cross-Jurisdictional Records and Judicial Systems

Public agencies face high stakes when sharing data. State correctional systems and county courts may want to study shared populations together, but legal mandates strictly prevent them from sharing raw defendant or inmate records.

PPRL removes that bottleneck. Both agencies independently encode their records into secure tokens. By exchanging those tokens, they identify overlapping individuals and build a linked analytical dataset without any raw PII crossing an organizational boundary.

The same architecture supports watchlist screening and daily judicial operations. Agencies can verify identities, coordinate parole supervision, and run warrant checks across jurisdictions without exposing their underlying databases to each other.

Probabilistic Data Structures: Q-grams and Bloom Filters

In the previous post, we introduced hashing as a mechanism for masking string content. PPRL takes that concept and makes it the computational foundation for approximate matching across encrypted space.

Consider the surname "smith" padded to smith, yielding the bigram set {_s, sm, mi, it, th, h_}. A data entry error producing "smyth" yields {_s, sm, my, yt, th, h_}. Four bigrams are shared. The similarity score remains high. The match survives the error.

1. Q-grams (Breaking Down the Identifier) The PPRL system first breaks each identifier into overlapping substrings called q-grams — typically bigrams of length q=2. The q-gram design gives the system resilience against real-world data corruption.

A monolithic string comparison fails the moment a character is transposed or deleted.
A q-gram set comparison degrades gracefully.

2. Bloom Filters (Building the Vector) Next, a Bloom filter encodes these q-grams into a fixed-length bit vector. The computational steps are straightforward:

Each q-gram is passed through k hash functions.
Each function returns a bit position in the vector, which is flipped to 1.

The final Outcome is a compact binary fingerprint of the original identifier — one that supports approximate similarity computation without revealing the identifier itself.

Here is the standard PPRL workflow:

Infographic illustrating the privacy-preserving linkage workflow by using Bloom filters. It depicts tokenizing a string into q-grams, hashing with keyed-hash-SHA, and setting bit vectors. It shows secure exchange between organizations for Jaccard similarity comparison, and provides examples of perfect and about matches, emphasizing international Unicode setup normalization. — ***The PPRL workflow by using Bloom filters. It hashes strings into bit vectors and compares them using Jaccard similarity.***

Keyed-Hash Message Authentication Code

The Structural Flaw: Frequency Attacks Standard non-cryptographic hash functions optimize for throughput. They are publicly known and reversible through dictionary attacks. An adversary with a common-name reference database can pre-compute Bloom filters for every name and compare them against received vectors — this is the frequency attack.

The solution is HMAC: Hash-based Message Authentication Code. HMAC wraps a cryptographic hash function — SHA-256 in production — with a secret key shared only between the participating data custodians.

The same q-gram encoded with different HMAC keys produces entirely different bit positions.
An adversary without the shared key cannot pre-compute any reference Bloom filter.

The security protocol: The HMAC key must carry a minimum entropy of 256 bits and must never be hardcoded. Key exchange between custodians occurs over a secure channel before any linkage operation. The Honest Broker — the party computing similarity — never holds this key. Without it, the bit vectors it processes are computationally opaque.

The Two Encoding Architectures

A single-attribute Bloom filter encodes one field in isolation. The security vulnerability is significant: a common surname like "Smith" produces nearly identical bit vectors across hundreds of records, giving an adversary a frequency-alignment attack surface.

Two design models address this.

Architecture	Mechanism	Privacy Level	Matching Quality
Cryptographic Linkage Key	Fields combined.	High	Handle missing values.
Record-level Bloom Filter	Generate attribute filters, then sample into one vector	Superior	High, prevents cross-attribute bigram collision

Cryptographic Linkage Keys (CLK)

The CLK encodes multiple quasi-identifiers — given name, surname, date of birth, ZIP code — simultaneously into a single 1,000-bit vector. Each field is allocated a distinct number of hash functions that controls how many bits it influences. Surname receives more hash functions than ZIP code because it carries more discriminatory power. This weighting directly controls each field's contribution to the final Jaccard score.

Record-level Bloom Filters (RBF)

The RBF architecture prevents an artifact present in CLK encoding. The bigram "ma" appearing in both the first name "Mary" and the city "Omaha" maps to the same bit positions in a CLK, artificially inflating the similarity score between records that share that coincidence. RBF isolates each attribute's bit space before sampling and permutation, eliminating cross-attribute contamination and producing a cleaner similarity signal.

Rust code snippet showing the 'encode_clk' function. It initializes a Bloom filter, iterates over record fields, generates field-specific keyed-hash keys, normalizes the text, tokenizes into q-grams, and encodes them into a single composite bit vector. — ***An implementation for generating CLK. It tokenizes fields and multiplexes them into a composite filter.***

Measuring Similarity in the Obfuscated Space

Vectors travel to the Honest Broker. The Honest Broker computes pairwise Jaccard similarity.

J for A and B = |A ∩ B| / |A ∪ B|

where |A ∩ B| equals bit positions set to 1 in both vectors. A score preceding a threshold declares a match.

The Sørensen-Dice coefficient provides an alternative metric.

Attacks and Hardening

Bloom filters are not cryptographically secure in the classical sense.

Three attack classes define the threat landscape:

Frequency attacks exploit Zipf's law — a small number of surnames appear with very high frequency, allowing an adversary with a plaintext name table to align frequency distributions against intercepted bit patterns without ever possessing the HMAC key.
Pattern-mining attacks, demonstrated by Christen, Schnell, Vatsalan, and Ranbaduge (2017), recover underlying q-grams by identifying co-occurring bit positions across the encoded database using frequent itemset mining.
Graph matching attacks align the encoded database as a graph against a plaintext reference graph; research shows this requires greater than 80% overlap between the two databases to succeed, limiting its practical scope in real deployments.

Three hardening techniques form current best practice:

Diffusion layers (BFD scheme) apply a linear transformation to Bloom filter output before transmission, breaking the direct q-gram-to-bit mapping that all three attacks exploit.
XOR folding generates new bit values from neighboring bit positions, obscuring bit structure without significant linkage quality loss.
BLIP (BLoom-and-flIP) introduces calibrated differential privacy noise by flipping bit values — reference-based BLIP ensures noise is correlated across similar records, preserving the distance structure needed for accurate matching.

Each of these mechanisms, the attacks that motivate them, and their formal security proofs will be covered in depth in the next post.

Regulatory Compliance

PPRL was designed for the regulatory environment, not retrofitted into it.

HIPAA Compliance by Design

PPRL eliminates the need to share direct PII for record linkage, making it HIPAA-compliant by design.

The Proof at Scale: PCORnet, NIH N3C, and the CDC's COVID-19 data linkage project demonstrate this at scale.

GDPR and Pseudonymization (Article 4(5))

Under GDPR Article 4(5), pseudonymization is defined as processing personal data so it can no longer be attributed to a specific data subject without additional information held separately. PPRL using HMAC-keyed Bloom filters meets this definition.

The Key: The HMAC secret key is the pseudonymization key.
The Risk: Its compromise reduces the entire system to a frequency attack.
The Mandate (EDPB Guidelines 01/2025): Effective pseudonymization requires secure key management, strict access controls, and separate storage of mapping data. HMAC keys must be stored in hardware security modules (HSMs) and subject to documented rotation schedules.

Data Protection by Design (GDPR Article 25)

PPRL embodies this principle. However, there is a critical operational requirement to maintain it.

The Accountability Catch: System specifiers must document Unicode normalization consistency across all data custodians as part of the PPRL system specification to satisfy GDPR accountability requirements.

Mechanism	Reversibility	GDPR Status	PPRL Role
HMAC-keyed Bloom filter	Yes (with key)	Pseudonymization	Primary encoding
Cryptographic hash (no key)	No	Context-dependent	Deterministic blocking only
FHE-encrypted embedding	Yes (with key)	Pseudonymization	High-security matching
Differential privacy output	No	Anonymization (if ε low)	Aggregate statistics
CLK (one-way)	No	Anonymization	Deployment-specific

References and credits

Primary Text

Christen, P., Ranbaduge, T., and Schnell, R., 2020,. Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer. ### Standards and specifications
Unicode Consortium. Unicode Annex #15: Unicode Normalization Forms . https://unicode.org/reports/tr15/
Unicode Consortium. Unicode Technical Standard #39: Unicode Security Mechanisms . https://www.unicode.org/reports/tr39/ ### Foundational research
Vatsalan, D., Christen, P., and Verykios, V.S., 2013,. A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38.
Schnell, R., Bachteler, T., and Reiher, J., 2009,. Privacy-preserving record linkage by using Bloom filters. `BioMed` Central Medical Informatics and Decision Making, 9.
Christen, P., Schnell, R., Vatsalan, D., and Ranbaduge, T., 2017,. Pattern-mining based cryptanalysis of Bloom filters for privacy-preserving record linkage. Proceedings of Institute of Electrical and Electronics Engineers International Conference on Data Mining.

Privacy Threats and Defenses

Heng, L., Armknecht, F., Chen, L., and Schnell, R., 2022,. On the effectiveness of graph matching threats against privacy-preserving record linkage.
Ranbaduge, T., and Schnell, R., 2020,. Strengthening privacy-preserving record linkage by using diffusion. Proceedings on Privacy Enhancing Technologies, 2023. https://petsymposium.org/popets/2023/popets-2023-0054.pdf
Vaiwsri, S., and Christen, P., 2018,. Reference values based hardening for Bloom filters based privacy-preserving record linkage. https://users.cecs.anu.edu.au/~christen/publications/vaiwsri2018blip.pdf
Lee, J., et al., 2025,. Zero-relationship encoding for privacy-preserving record linkage. Validated on Titanic and North Carolina Voter Registration datasets.

Cryptographic Privacy-Preserving Linkage

Randall, S., et al., 2015,. Privacy-preserving record linkage by using Homomorphic encryption.
Mainzelliste `SecureEpiLinker` : Privacy-preserving record linkage by using secure multi-party computation.
Vatsalan, D., et al., 2016,. Scalable privacy-preserving record linkage for many databases. Proceedings of Association for Computing Machinery Conference on Information and Knowledge Management.

Statistical Anonymization

Sweeney, L., 2002,. k-Anonymity a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems.
Machanavajjhala, A., et al., 2007,. l-Diversity: Privacy beyond k-anonymity. Association for Computing Machinery Transactions on Knowledge Discovery from Data.
Dwork, C., 2006,. Differential privacy. Proceedings of International Colloquium on Automata Languages and Programming.

Internationalization and Unicode Format

ICU4X Documentation. https://docs.rs/icu
ICU4X 2.0 Release. The Unicode Blog. http://blog.unicode.org/2025/05/icu4x-20-released.html
icu_normalizer crate documentation. https://docs.rs/icu_normalizer

Regulatory

EU Data Protection Board Guidelines 01/2025 on Pseudonymisation.
National Institute of Standards and Technology SP 800-57 Part 1 Rev. 5: Recommendation for Key Management. https://csrc.nist.gov/pubs/sp/800/57/pt1/r5/final
EU Patient Identity Management, configurable privacy-preserving record linkage in federated health data spaces. Frontiers in Digital Health, 2026. https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2026.1751234/full

Modern Advances

`CampER` framework. Association for Computing Machinery Knowledge Discovery and Data Mining, 2023.
Multimodal Fully Homomorphic Encryption entity resolution. ICASSP, 2026
`OneFlorida` clinical research network hash-based privacy-preserving linkage tool. `PubMed` Central. https://pmc.ncbi.nlm.nih.gov/articles/PMC6994009/

GeCo and Benchmark Datasets

Christen, P., and Pudjijono, A., 2009. GeCo: An online personal data generator and corruptor. Proceedings of the Australasian Database Conference. https://dmm.anu.edu.au/geco/
North Carolina Voter Registration Database benchmark. https://dmm.anu.edu.au/lsdbook2020/index.php