Accessing Underlying Bytes in a String for Different Encodings

Strings in .NET are stored in memory as Unicode character data, using the UTF-16 encoding. (2 bytes per character, or 4 bytes for surrogate pairs).

If you want to get access to the underlying data for the string in memory, you can use one of the functions listed below, indicating what encoding to use for the Unicode data when converting it to a byte array. If you use Encoding.Unicode, you’ll get the data exactly as it is stored in memory for the String type.

System.Text.Encoding.Unicode.GetBytes – UTF-16
System.Text.Encoding.UTF8.GetBytes – UTF-8

In the example below, notice the different byte sequences used to encode the CJK character.

string ideograph = "𠈓";
byte[] utf16 = Encoding.Unicode.GetBytes(ideograph);
byte[] utf8 = <span class="skimlinks-unlinked">Encoding.UTF8.GetBytes(ideograph</span>);

Be a Fan

Hash OuT

Accessing Underlying Bytes in a String for Different Encodings

Popular Posts

Subscribe Now

Total Pageviews

Is this Blog Useful?