Text is complicated. Before Unicode, every country and software vendor invented their own encoding systems — and they didn't agree with each other. Here's a visual guide to how we got into that mess, and how we got out.
A brief history of character encoding chaos
Before Unicode, every country and software vendor invented their own systems — and they didn't agree with each other.
1960s
ASCII — 7-bit English only
128 characters: English letters, digits, punctuation. Codes 0–127. Worked fine if you only wrote in English.
1970s–80s
OEM free-for-all (128–255)
Everyone used the spare high bit differently. Code 130 = é in France, ג in Israel. Résumés arrived as rגsumגs.
1980s
ANSI code pages — standardised mess
Below 128: agreed. Above 128: hundreds of "code pages" (CP862 for Hebrew, 737 for Greek…). Hebrew + Greek on one screen? Impossible.
1980s (Asia)
DBCS — double-byte hacks
Asian scripts have thousands of characters. Some letters used 1 byte, some 2. Moving backwards through a string required special OS functions.
1991
Unicode — one ring to rule them all
A single universal standard. Every character in every writing system (+ Klingon) gets a unique code point. No more conflicts.
The internet exposed the mess: once you started sending text between computers, all those conflicting systems broke down.
How Unicode thinks about characters
Unicode separates two ideas that used to be mixed up: what a character is vs. how it's stored in memory.
Step 1 — Platonic ideal (abstract identity)
Every letter is an abstract concept, independent of font, style, or how it looks. A in Times New Roman = A in Arial. They're the same character.
Step 2 — Code point (a unique number)
Each abstract character gets a unique number, written U+XXXX.
A = U+0041é = U+00E9ع = U+0639😀 = U+1F600
There are over 1.1 million possible code points — far more than 65,536, so the "Unicode = 16 bits" myth is wrong.
Step 3 — Encoding (how it's stored)
Code points are abstract. Encoding is how you write them to disk or memory as actual bytes. This is a separate decision — and there are several options.
"Hello" as code points: U+0048 U+0065 U+006C U+006C U+006F — five numbers. Not yet bytes. Not yet stored. Just identity.
Unicode encodings compared
Same code points, different bytes on disk. Each encoding has trade-offs.
Encoding
How it works
Best for
UTF-8Recommended
1 byte for ASCII (0–127), 2–4 bytes for everything else
Web, files, APIs. ASCII-compatible — English text looks identical to ASCII.
UTF-16
2 bytes for most chars, 4 for rare ones. Needs byte-order mark (BOM)
Windows internals, Java, JavaScript strings
UTF-32 / UCS-4
Always 4 bytes per character — fixed width
Internal processing where random access matters. Very space-inefficient.
UCS-2
Always 2 bytes, max 65,536 chars. Can't represent rare characters.
Legacy Windows, older COM/VB programs
Latin-1 / ISO-8859-1Limited
1 byte, 256 characters. Western European only.
Legacy Western European documents. Russian? Chinese? → ???
UTF-8 storing "Hello": 48 65 6C 6C 6F — identical to ASCII. Americans won't even notice the difference.
Byte order mark problem (UTF-16)
"H" in big-endian UTF-16: 00 48 "H" in little-endian UTF-16: 48 00
Both are valid! A byte order mark FE FF at the start tells the reader which byte order to use.
See the bytes yourself
Type a character or word and see how it's encoded in different systems.
The single most important rule
After all the history and encoding details, this is the one thing that matters.
A string without a known encoding is meaningless. There is no such thing as "plain text."
Why this breaks websites
If a browser doesn't know the encoding of a page, it guesses based on byte frequency patterns. Sometimes it guesses wrong — a Bulgarian page shows up as Korean. The user sees gibberish.
The fix for HTML
Put this as the very first tag inside <head>:
<meta charset="UTF-8">
Or the server sends: Content-Type: text/html; charset=utf-8
The fix for email
Email headers must declare: Content-Type: text/plain; charset="UTF-8"
Without this, your accented characters may arrive as ????.
Practical advice for developers
Always use UTF-8 for new projects — files, databases, APIs, HTML.
When reading old data, find out its encoding before touching it.
Never assume that text is ASCII or that "it looks fine on my machine."