#15

A detailed description of a futuristic digital illustration. At the center, a stylized globe of Earth is formed by interconnected data nodes and streams. The text "ICU" and "I18N" in glowing neon letters are prominently displayed in the center. Below the "I18N" text is the bold main title: "UNLOCKING GLOBAL MARKETS." Below that, a smaller subtitle reads: "The Essential Guide to International Components for Unicode & Internationalization." A larger glowing ring of multicolored text containing diverse world scripts (like Arabic, Chinese, Japanese, and Cyrillic) and the Unicode symbol encircles the globe. The background features faint outlines of a world map and server racks in a data center, all in a blue, purple, and orange color palette.

The infrastructure behind global text: I18N, ICU, and why Rust does it differently

Encoding fragmentation breaks entity resolution pipelines. A sixty-year-old Unicode anomaly fractures entity graphs and drops database joins. We examine the root cause of these byte-level mismatches. We detail the mechanics of internationalization (i18n) engineering, the evolution of the ICU library family, and how Rust's ICU4X datagen architecture handles multilingual text with zero-copy deserialization and compiler-enforced memory safety.

Gandhinath Swaminathan

Mar 257 min read

An infographic showing a secure data flow from three sources: 'Hospital: Disease' (blue), 'Pharmacy: Medication' (green), and 'Lab: Test Results' (purple). The streams of binary data and points, linked to 'Name and DOB' with padlocks, converge into a central 'Honest Broker' server rack. A unified data stream then flows from the server to a 'Unified Analysis' digital dashboard displaying charts and graphs, viewed by a glowing human silhouette on the right.

Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

Privacy-Preserving Record Linkage (PPRL) lets two organizations determine which records refer to the same person — without either party seeing the other's data. Healthcare networks, judicial agencies, and government registries link records using cryptographic Bloom filters and HMAC-keyed hashing. But if strings are not Unicode-normalized before encoding, the hash diverges and matches fail silently. This post shows the full pipeline: q-grams, Bloom filters, CLKs, and normaliza

Gandhinath Swaminathan

Mar 249 min read

Screenshot of a bookstore website search for "aurellion guereon scikit". The search returns four versions of the book "Hands-On Machine Learning". Beneath the covers, the author's name, Aurélien Géron, is listed with different text formatting issues across the results, including "Aurelien Geron" (missing accents) and "G?ron Aur?lien" (showing question marks instead of accented characters)

The hidden complexity of text: A close look at Unicode normalization for entity resolution

Entity resolution runs on joins, comparisons, and blocking keys that assume “same-looking text” means “same bytes.” That assumption is wrong once you move outside ASCII. To solve this encoding chaos, data architectures must standardize text at the lowest possible layer: the Unicode byte structure. A normalization strategy that's too weak misses matches; one that's too aggressive destroys signal and creates false positives.

Gandhinath Swaminathan

Mar 1311 min read

The infrastructure behind global text: I18N, ICU, and why Rust does it differently

Privacy-Preserving Record Linkage: Cryptography, Unicode, and Matching In the Dark

The hidden complexity of text: A close look at Unicode normalization for entity resolution

Sustainable Entity Resolution: Profiling and Energy Measurement

The infrastructure behind global text: I18N, ICU, and why Rust does it differently