Unicode & Character Encoding

Text is complicated. Before Unicode, every country and software vendor invented their own encoding systems — and they didn't agree with each other. Here's a visual guide to how we got into that mess, and how we got out.

A brief history of character encoding chaos

Before Unicode, every country and software vendor invented their own systems — and they didn't agree with each other.

1960s

ASCII — 7-bit English only

128 characters: English letters, digits, punctuation. Codes 0–127. Worked fine if you only wrote in English.

1970s–80s

OEM free-for-all (128–255)

Everyone used the spare high bit differently. Code 130 = é in France, ג in Israel. Résumés arrived as rגsumגs.

1980s

ANSI code pages — standardised mess

Below 128: agreed. Above 128: hundreds of "code pages" (CP862 for Hebrew, 737 for Greek…). Hebrew + Greek on one screen? Impossible.

1980s (Asia)

DBCS — double-byte hacks

Asian scripts have thousands of characters. Some letters used 1 byte, some 2. Moving backwards through a string required special OS functions.

1991

Unicode — one ring to rule them all

A single universal standard. Every character in every writing system (+ Klingon) gets a unique code point. No more conflicts.

The internet exposed the mess: once you started sending text between computers, all those conflicting systems broke down.

How Unicode thinks about characters

Unicode separates two ideas that used to be mixed up: what a character is vs. how it's stored in memory.

Step 1 — Platonic ideal (abstract identity)

Every letter is an abstract concept, independent of font, style, or how it looks. A in Times New Roman = A in Arial. They're the same character.

Step 2 — Code point (a unique number)

Each abstract character gets a unique number, written U+XXXX.

A = U+0041 é = U+00E9 ع = U+0639 😀 = U+1F600

There are over 1.1 million possible code points — far more than 65,536, so the "Unicode = 16 bits" myth is wrong.

Step 3 — Encoding (how it's stored)

Code points are abstract. Encoding is how you write them to disk or memory as actual bytes. This is a separate decision — and there are several options.

"Hello" as code points: U+0048 U+0065 U+006C U+006C U+006F — five numbers. Not yet bytes. Not yet stored. Just identity.

Unicode encodings compared

Same code points, different bytes on disk. Each encoding has trade-offs.

Encoding	How it works	Best for
UTF-8 Recommended	1 byte for ASCII (0–127), 2–4 bytes for everything else	Web, files, APIs. ASCII-compatible — English text looks identical to ASCII.
UTF-16	2 bytes for most chars, 4 for rare ones. Needs byte-order mark (BOM)	Windows internals, Java, JavaScript strings
UTF-32 / UCS-4	Always 4 bytes per character — fixed width	Internal processing where random access matters. Very space-inefficient.
UCS-2	Always 2 bytes, max 65,536 chars. Can't represent rare characters.	Legacy Windows, older COM/VB programs
Latin-1 / ISO-8859-1 Limited	1 byte, 256 characters. Western European only.	Legacy Western European documents. Russian? Chinese? → ???

UTF-8 storing "Hello": 48 65 6C 6C 6F — identical to ASCII. Americans won't even notice the difference.

Byte order mark problem (UTF-16)

"H" in big-endian UTF-16: 00 48 "H" in little-endian UTF-16: 48 00
Both are valid! A byte order mark FE FF at the start tells the reader which byte order to use.

See the bytes yourself

Type a character or word and see how it's encoded in different systems.

The single most important rule

After all the history and encoding details, this is the one thing that matters.

A string without a known encoding is meaningless. There is no such thing as "plain text."

Why this breaks websites

If a browser doesn't know the encoding of a page, it guesses based on byte frequency patterns. Sometimes it guesses wrong — a Bulgarian page shows up as Korean. The user sees gibberish.

The fix for HTML

Put this as the very first tag inside <head>:

<meta charset="UTF-8">

Or the server sends: Content-Type: text/html; charset=utf-8

The fix for email

Email headers must declare: Content-Type: text/plain; charset="UTF-8"
Without this, your accented characters may arrive as ????.

Practical advice for developers

Always use UTF-8 for new projects — files, databases, APIs, HTML.

When reading old data, find out its encoding before touching it.

Never assume that text is ASCII or that "it looks fine on my machine."