Getting Length of String that Contains Surrogate Pairs

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).
Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.
To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// 3 Latin (ASCII) characters
string simple = "abc";
 
// 3 character string where one character
//  is a surrogate pair
string containsSurrogatePair = "A𠈓C";
 
// Length=3 (correct)
Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));
 
// Length=4 (not quite correct)
Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));
 
// Better, reports Length=3
StringInfo si = new StringInfo(containsSurrogatePair);
Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));


1007-001