Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk. UTF-16 and UTF-8 are the most commonly used encoding schemes for Unicode character data.
Below are some examples of how various characters would be encoded in UTF-16 and UTF-8.
- Latin capital ‘A’, code point U+0041
- UTF-16: 2 bytes, 00 41 (hex)
- UTF-8: 1 byte, 41 (hex)
- Latin lowercase ‘é’ with acute accent, code point U+00E9
- UTF-16: 2 bytes, 00 E9 (hex)
- UTF-8: 2 bytes, C3 A9 (hex) [110x xxxx 10xx xxxx]
- Mongolian letter A, U+1820
- UTF-16: 2 bytes, 18 20 (hex)
- UTF-8: 3 bytes, E1 A0 A0 (hex) [1110 xxxx 10xx xxxx 10xx xxxx]
- Ace of Spades playing card character, U+1F0A1
- UTF-16: 4 bytes, D8 3C DC A1
- UTF-8: 4 bytes, F0 9F 82 A1 [1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx]