How to convert UTF-8 byte[] to string in C# ?

To convert a UTF-8 encoded byte array to a string in C#, you can use the Encoding.UTF8 class from the System.Text namespace. Below are detailed steps and examples:

Step-by-Step Explanation

  1. Use Encoding.UTF8.GetString()
    Convert the byte array directly to a string using the UTF-8 encoder.
  2. Handle Edge Cases
    Check for null or empty byte arrays to avoid exceptions.
  3. Optional BOM (Byte Order Mark)
    UTF-8 may include a BOM (0xEF, 0xBB, 0xBF), which is optional. The method handles it automatically.
  4. Error Handling
    Handle invalid UTF-8 sequences (optional).

Examples

1. Basic Conversion

using System.Text;

byte[] utf8Bytes = { 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100 }; // "Hello World"
string result = Encoding.UTF8.GetString(utf8Bytes);
Console.WriteLine(result); // Output: Hello World

2. Handle Null/Empty Arrays

byte[] utf8Bytes = null;

// Check for null or empty arrays
string result = utf8Bytes != null ? Encoding.UTF8.GetString(utf8Bytes) : string.Empty;
Console.WriteLine(result);

3. With BOM (Byte Order Mark)

// Byte array with BOM: 0xEF, 0xBB, 0xBF
byte[] utf8BytesWithBOM = { 0xEF, 0xBB, 0xBF, 72, 101, 108, 108, 111 }; // "Hello"

// The BOM is automatically recognized and ignored in the resulting string
string result = Encoding.UTF8.GetString(utf8BytesWithBOM);
Console.WriteLine(result); // Output: Hello

// To explicitly skip the BOM (if present):
if (utf8BytesWithBOM.Length >= 3 && 
    utf8BytesWithBOM[0] == 0xEF && 
    utf8BytesWithBOM[1] == 0xBB && 
    utf8BytesWithBOM[2] == 0xBF)
{
    result = Encoding.UTF8.GetString(utf8BytesWithBOM, 3, utf8BytesWithBOM.Length - 3);
}

4. Handle Invalid UTF-8 Sequences

By default, invalid bytes are replaced with the Unicode replacement character (). To throw an error instead, use a custom encoder:

// Create an encoder that throws on invalid bytes
var encoder = Encoding.GetEncoding(
    "UTF-8",
    new EncoderExceptionFallback(),
    new DecoderExceptionFallback()
);

byte[] invalidBytes = { 72, 101, 0xFF, 108, 111 }; // Invalid byte 0xFF

try
{
    string result = encoder.GetString(invalidBytes);
}
catch (DecoderFallbackException ex)
{
    Console.WriteLine("Invalid UTF-8 sequence: " + ex.Message);
}

Key Considerations

  • Namespace: Always include System.Text.
  • BOM Handling: The BOM is optional and rarely used in UTF-8. Most modern systems omit it.
  • Performance: For large byte arrays, Encoding.UTF8.GetString() is efficient.
  • Error Behavior: By default, invalid bytes are replaced. Use DecoderFallback to customize this behavior.

Alternative: Using Span<byte> (C# 7.2+)

For memory efficiency with large data:

byte[] utf8Bytes = { 72, 101, 108, 108, 111 };
ReadOnlySpan<byte> byteSpan = utf8Bytes;
string result = Encoding.UTF8.GetString(byteSpan);
Console.WriteLine(result); // Output: Hello

Summary

  • Use Encoding.UTF8.GetString(byteArray) for basic conversions.
  • Check for null or empty arrays.
  • Handle BOM if necessary.
  • Customize error handling with DecoderFallback.

This approach ensures proper handling of UTF-8 byte arrays in C#.

Leave a Reply

Your email address will not be published. Required fields are marked *