top of page

The infrastructure behind global text: I18N, ICU, and why Rust does it differently

  • Writer: Gandhinath Swaminathan
    Gandhinath Swaminathan
  • 1 day ago
  • 7 min read
This Post at a Glance
The thread: This blog post is part 2 of 4-part series on Unicode, I18N, and Entity Resolution, covering the cause and what the industry built to fix it. 
Read Part 1 of the series titled "The hidden complexity of text: A close look at Unicode normalization for entity resolution" that covers symptoms.

The root cause: the encoding fragmentation your pipeline inherits today started in 1963 with a 7-bit standard never designed to leave America. Unicode fixed the legacy constraints but introduced a new anomaly: many valid byte representations for the same visible character.

The engineering discipline: internationalization (I18N) provides the infrastructure work that ensures your data system handles any language correctly without rewriting code per locale. It provides more than translation. It defines the architecture.

The library family: every major OS, database, and browser calls International Components for Unicode, or ICU, when processing text. ICU4C handles native/OS workloads. ICU4J handles Java Virtual Machine pipelines. ICU4X, written in Rust, takes a different approach to data management.

The architecture break: the `icu4x-datagen` tool decouples locale data from logic, compiling only the locales and components your app uses directly into the binary. The result: zero cold-start overhead, 10x faster table lookups via zero-copy de-serialization, and memory safety enforced by the compiler.
A detailed description of a futuristic digital illustration. At the center, a stylized globe of Earth is formed by interconnected data nodes and streams. The text "ICU" and "I18N" in glowing neon letters are prominently displayed in the center. Below the "I18N" text is the bold main title: "UNLOCKING GLOBAL MARKETS." Below that, a smaller subtitle reads: "The Essential Guide to International Components for Unicode & Internationalization." A larger glowing ring of multicolored text containing diverse world scripts (like Arabic, Chinese, Japanese, and Cyrillic) and the Unicode symbol encircles the globe. The background features faint outlines of a world map and server racks in a data center, all in a blue, purple, and orange color palette.
From 60 years of encoding problems to a single normalized stream: the infrastructure debt every global data pipeline inherits.

The root cause dates back 60 years

The industry published ASCII in 1963 as a 7-bit encoding standard built for English. Its 128-character limit represented a strict hardware constraint from the 1960 decade. The standard supported basic English text well. As computing spread internationally, the monoculture fractured.


The eighth unused bit became a free-for-all. International Business Machines mainframes ran EBCDIC, incompatible with ASCII entirely. European systems extended the 128-slot table in many conflicting ways to encode accented characters. Asian countries faced tens of thousands of Chinese, Japanese, and Korean characters and built their own multi-byte encodings: shift Japanese Industrial Standards (JIS), GB2312, and Extended Unix Code (EUC). By the time the internet arrived, an invoice written on a Windows-1252 system would corrupt on a machine expecting International Organization for Standardization 8859-5 for Cyrillic. Having hundreds of encoding standards proved worse than having only one.


Unicode fixed the standard but created an anomaly

The industry conceived Unicode in 1987 specifically to replace this mess. One encoding for every writing system on earth. Today it covers 144,000+ characters across 159 scripts.


Yet, this design created a new category of failure modes.


Unicode designers maintained backward compatibility with legacy systems. The standard permits many valid byte representations for the same visible character. For example, the é in Aurélien Géron can live as a single precomposed code point, U+00E9. It can also exist as two code points: the base letter e, U+0065, followed by a combining acute mark, U+0301. Both forms represent valid Unicode. They look identical on screen. These bytes differ.


A hash function sees two different byte arrays (as illustrated in the infographic below). This format with many representations provides the exact mechanism that breaks entity resolution. If your architecture lacks a strict normalization contract at ingestion, it interprets one physical entity as separate database records.


A diagram demonstrating how different Unicode representations of the letter 'é' (precomposed vs. decomposed) result in different database hashes, breaking entity resolution. It then shows a Rust code solution using NFC normalization to convert both forms into identical UTF-8 bytes, ensuring they resolve to a single database record.
Unicode: the é problem

Defining internationalization (I18N)

Most people hear "internationalization" and picture translation. This describes localization, or l10n, the downstream content work of adapting a product for a specific language and region. I18N provides the precondition. You can't localize a product without internationalizing it first. For data engineering, that distinction matters.


I18N represents the engineering discipline of designing software so it handles text from any language correctly, without requiring structural code changes per locale. The W3C and Microsoft define it the same way:

Make the codebase locale-agnostic so that localization becomes a data and configuration concern, not a code concern.

For an entity resolution engine, I18N means:

  1. Character encoding: UTF-8 across every layer so no script gets corrupted in transit.

  2. Collation: sorting strings in the correct order for the language. For example, Swedish treats ä as a distinct letter after z, not a variant of a.

  3. Case folding: not just lowercasing, but handling Turkish İ/ı pairs, German ß expanding to SS, and similar rules that break naive .lower() calls.

  4. Number and date formats: 1.000,00 means ten thousand in Germany.

  5. Plural rules: Arabic has six grammatical number forms; Japanese has one.

  6. Bidirectional text: Arabic and Hebrew flow right to left.


For entity resolution specifically, the two I18N dimensions that matter include encoding and collation/normalization. Encoding addresses the byte-mismatch. Collation/normalization handles the comparison logic. These axes represent where unnormalized pipelines fail with a missed join.


ICU: What runs in production

The industry relies on the International Components for Unicode (ICU). International Business Machines originally developed these base libraries in 1999. The Unicode Consortium now maintains ICU, which provides code page conversion, normalization, and collation.


When a database like PostgreSQL or an OS like Windows runs string collation or date formatting, it executes logic wrapped around the ICU libraries.


ICU's core services span the full I18N technical scope:

  • Unicode normalization: NFC, NFD, NFKC, NFKD per Unicode Standard Annex #15.

  • Collation via the Unicode Collation Algorithm using locale-specific rules from the Common Locale Data Repository (CLDR).

  • Case folding including Turkish dotted-I rules and German ß to SS expansion.

  • Code page conversion between Unicode and hundreds of legacy encodings.

  • Text segmentation across character, word, sentence, and line boundaries.

  • Bidirectional text rendering.

  • Date, time, number, currency, and message formatting for 900+ locales.


ICU4J, ICU4C, and ICU4X: The family tree

The ICU project provides three major implementations. Each targets a different environment and includes architectural tradeoffs that made sense during development.

Dimension

ICU4C

ICU4J

ICU4X

Language

C/C++

Java

Rust (with FFI bindings to C++, JS)

Primary users

Operating systems, databases, browsers

Java Virtual Machine apps, Android

Embedded, WASM, mobile, constrained environments

Data loading

Runtime .dat files/shared libraries

Runtime JAR resource bundles

Compile-time baked data OR pluggable runtime providers

Memory model

Heap-allocated, internal caches

GC-managed, internal caches

Zero-copy deserialization, zero allocations for data loading

Binary size

Large, such as 30 MB for full data

Large

Tree-shaken to only used components and locales

Portability

Native platform only

Java Virtual Machine only

#![no_std] Rust; runs on wearable devices, embedded, WASM

Memory safety

C++ with Common Vulnerabilities and Exposures

Java Virtual Machine sandbox

Rust ownership model, memory-safe at compile time

The primary limitation of the ICU4J and ICU4C libraries involves rule dependency.


Internationalization algorithms require many static locale rules from the Common Locale Data Repository. ICU4C and ICU4J bundle these rules into monolithic package files or JAR resource files.


Loading the rules requires pushing a large payload into memory at runtime. If your Apache Spark job needs to deploy a custom string comparison function across a thousand ephemeral worker nodes, bundling a full ICU4C shared library introduces severe bloat and latency.


The API logic couples with this monolithic design. If you import ICU4C to perform NFKC normalization and case folding for an entity-matching routine, you pull in dependencies for every locale feature. You receive currency format objects and calendar symbols you never use.


This architecture prevents compiling legacy ICU into WebAssembly (WASM), deploying it to edge devices, or running it in memory-constrained microservices.


The Rust departure: ICU4X and modularity

Developers wrote ICU4X in Rust from scratch. The "X" in ICU4X stands for cross-environment portability. It favors modularity, memory safety, and pluggable loading

.

ICU4X abandons the monolithic design. Unlike ICU4C, ICU4X uses distinct, feature-specific crates:

  • icu_normalizer: NFKC, NFC, NFD, NFKD

  • icu_collator: locale-aware string comparison

  • icu_segmenter: text boundary detection

  • icu_datetime: date and time formatting

  • icu_casemap: case folding

  • icu_calendaricu_pluralsicu_displaynames, and more


The base library compiles without relying on the Rust standard library by using the #![no_std] attribute. It uses only the base language and heap allocation. This design choice permits your entity-resolution logic to execute in embedded environments, WebAssembly (WASM) runtime environments, or custom database kernels that lack heavy operating system (OS) facilities.


Why Rust handles it differently: The datagen architecture

The icu4x-datagen utility changes the deployment pattern.


Instead of shipping a monolithic file, you use icu4x-datagen to generate exactly the rules your app needs. You specify the target locales and the required components at build time. If your app only evaluates English, German, and Japanese, you generate rules for only those three locales. The utility outputs three primary formats:

  1. --format baked: generates Rust source code, compiled directly into the binary at build time, zero runtime loading, zero deserialization

  2. --format blob: generates a Postcard-format binary file for zero-copy runtime loading

  3. --format dir: generates a directory of individual locale data files


Memory handling: Zero-copy and targeted lifetime erasure

High-throughput entity-resolution engines compare millions of records per second. Traditional loading patterns allocate heap memory, parse incoming bytes, and copy the data field-by-field into app structs. At that volume, this constant allocation cycle burns CPU and creates garbage collection pressure that degrades throughput.


ICU4X bypasses this using zero-copy de-serialization, powered by the zerovec crate. Instead of copying bytes, a zero-copy parser validates the byte array's structure and constructs structs holding direct memory pointers into the source bytes. The locale data remains ready for use immediately.

A memory management schematic contrasting standard heap allocation, which copies data, with zero-copy deserialization, which uses direct memory pointers to reference the source buffer.
Standard parsing allocates heap memory and copies data field-by-field. Zero-copy deserialization constructs Rust structs that point directly into the source byte buffer, dropping allocation overhead to zero.

In an entity-resolution pipeline processing hundreds of millions of records, those savings could exceed 10x per table lookup.


Yet, zero-copy introduced a strict architectural constraint.


In Rust, a struct holding a pointer to a byte buffer must carry a lifetime parameter. That parameter proves the struct can't outlive the buffer it points to. This guarantee works well. Yet, this parameter spreads across every function and cache that touches the struct. To overcome this, ICU4X maintainers used the yoke crate.


The yoke crate binds the zero-copy de-serialization object directly to its backing byte buffer using a Yoke<Y, C> struct. Internally, it uses a hidden 'static lifetime to convince the compiler the data lives forever, erasing the contagious lifetime from the public API boundary.


Developers can't remove the zero-copy data from the Yoke directly. Accessing it requires a getter method that reifies the erased lifetime into a short-lived local reference, scoped to the Yoke's own borrow.


Postscript

Encoding fragmentation persists. It remains baked into 60 years of decisions no one on your team made. Yet, with the right library running the right architecture, your normalization layer stops providing a source of silent failures. It starts serving as a predictable, verifiable part of your match specification.




Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page