Overview - Char type and Unicode behavior

What is it?

The Char type in C# represents a single character using Unicode encoding. It stores a 16-bit value that corresponds to a UTF-16 code unit. This allows it to represent most common characters from many languages, symbols, and emojis. Understanding how Char works with Unicode helps you handle text correctly in programs.

Why it matters

Without understanding Char and Unicode, programs can misinterpret characters, causing bugs like wrong text display or data corruption. Since computers store text as numbers, knowing how characters map to numbers ensures your program reads, writes, and processes text reliably across languages and platforms. This is crucial for apps that handle international text or special symbols.

Where it fits

Before learning Char and Unicode, you should know basic data types and how computers store numbers. After this, you can learn about strings, text encoding, and globalization in programming. This knowledge builds a foundation for working with text input, output, and storage in real-world applications.

Mental Model

Core Idea

A Char in C# is a 16-bit number representing a single UTF-16 code unit, which maps to a character or part of a character in Unicode.

Think of it like...

Think of Char as a single tile in a mosaic. Each tile has a color code (number) that tells you what part of the picture it shows. Sometimes one tile shows a whole small image (a simple character), but sometimes you need two tiles together to see a full picture (a complex character).

┌───────────────┐
│   Char (16b)  │
├───────────────┤
│ UTF-16 code   │
│ unit number   │
└──────┬────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐
│ Simple char   │   │ Surrogate pair │
│ (one code)    │   │ (two codes)    │
└───────────────┘   └───────────────┘

Build-Up - 7 Steps

1

FoundationWhat is the Char type in C#

Concept: Char stores a single 16-bit Unicode code unit representing a character.

In C#, the Char type holds one UTF-16 code unit. It is a value type and uses 2 bytes of memory. You can declare a Char like this: char letter = 'A';. This stores the Unicode number for 'A'.

Result

You have a variable that holds one character, like 'A', stored as a number internally.

Understanding Char as a 16-bit number is key to knowing how characters are stored and manipulated in C#.

2

FoundationUnicode and UTF-16 basics

3

IntermediateSimple characters vs surrogate pairs

4

IntermediateChar operations and Unicode values

5

IntermediateLimitations of Char with full Unicode text

6

AdvancedWorking with surrogate pairs in C#

7

ExpertInternal representation and performance trade-offs

Under the Hood

Char stores a 16-bit unsigned integer representing a UTF-16 code unit. For characters in the Basic Multilingual Plane (BMP), this 16-bit value directly maps to a Unicode code point. For characters outside BMP, two Char values form a surrogate pair: a high surrogate (from 0xD800 to 0xDBFF) followed by a low surrogate (from 0xDC00 to 0xDFFF). The runtime treats strings as arrays of Char, so indexing accesses these code units, not full Unicode characters. Methods like Char.IsSurrogate help detect surrogate halves.

Why designed this way?

C# and .NET chose UTF-16 and Char to align with Windows APIs and Unicode standards at the time, balancing memory use and compatibility. UTF-16 was widely adopted for international text, supporting most characters in one unit and allowing surrogate pairs for others. This design simplifies many operations but requires care with surrogate pairs. Alternatives like UTF-8 were less common in Windows environments when .NET was designed.

┌───────────────┐
│   Char (16b)  │
├───────────────┤
│ 0x0000-0xD7FF │  ← BMP characters (single Char)
│ 0xD800-0xDBFF │  ← High surrogate (first half)
│ 0xDC00-0xDFFF │  ← Low surrogate (second half)
│ 0xE000-0xFFFF │  ← BMP characters (single Char)
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Surrogate pair (two Char)   │
│ High surrogate + Low surrogate│
│ represent one Unicode char   │
└─────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does one Char always equal one visible character? Commit yes or no.

Common Belief:One Char always represents one full character visible to the user.

Tap to reveal reality

Quick: Can you safely index a string by position to get full characters? Commit yes or no.

Common Belief:Indexing a string by position always returns a complete character.

Tap to reveal reality

Quick: Is UTF-16 the only way to encode Unicode in memory? Commit yes or no.

Common Belief:UTF-16 and Char are the only ways to represent Unicode characters in memory.

Tap to reveal reality

Quick: Does casting Char to int always give the Unicode code point? Commit yes or no.

Common Belief:Casting a Char to int always gives the full Unicode code point of the character.

Tap to reveal reality

Expert Zone

1

Surrogate pairs must always appear in high-low order; reversing them causes invalid text and runtime errors.

2

Some Unicode characters are formed by combining multiple code points (grapheme clusters), which Char cannot represent alone.

3

String normalization affects how characters are stored and compared, impacting Char-level operations.

When NOT to use

Char is not suitable when working with full Unicode characters that may require multiple code units, such as emojis or combined characters. Instead, use string-level APIs like System.Globalization.StringInfo or libraries that handle grapheme clusters. For performance-sensitive text processing, consider UTF-8 encoded byte arrays with specialized parsers.

Production Patterns

In production, Char is used for low-level text processing, parsing, and validation where single code units suffice. For user-facing text, developers use string APIs that handle surrogate pairs and normalization. Libraries and frameworks often abstract away Char details, but understanding it helps debug encoding issues and optimize performance.

Connections

String encoding and decoding

Builds-on

Understanding Char and UTF-16 is essential to grasp how strings encode text and how encoding affects storage and transmission.

Internationalization and globalization

Builds-on

Knowing Char and Unicode behavior helps handle multilingual text correctly, a core need in global software.

Digital image pixels

Analogy in data representation

Just as pixels combine to form images, Char units combine to form characters; understanding this helps appreciate data granularity in different fields.

Common Pitfalls

#1Assuming one Char equals one character and indexing strings by Char position to get characters.

Wrong approach:char c = myString[5]; // assumes c is a full character

Correct approach:var textEnum = new System.Globalization.StringInfo(myString); string character = textEnum.SubstringByTextElements(5, 1);

Root cause:Misunderstanding that strings are UTF-16 code units, not full characters, leading to broken surrogate pairs.

#2Casting Char to int and treating it as a full Unicode code point.

Wrong approach:int codePoint = (int)myChar; // may be half of surrogate pair

Correct approach:Use char.IsSurrogate(myChar) to detect surrogates and combine pairs to get full code point.

Root cause:Ignoring surrogate pairs and treating UTF-16 units as full Unicode points.

#3Using Char to store emojis or complex characters directly.

Wrong approach:char emoji = '😊'; // emoji needs surrogate pair, cannot fit in one Char

Correct approach:string emoji = "😊"; // string holds surrogate pairs correctly

Root cause:Not recognizing that some characters require two Char units, so Char alone is insufficient.

Key Takeaways

Char in C# stores a 16-bit UTF-16 code unit, not always a full character.

Unicode characters outside the Basic Multilingual Plane require two Char values called surrogate pairs.

Indexing strings by Char position may split surrogate pairs, so use specialized APIs for full characters.

Casting Char to int gives the UTF-16 code unit value, which may be only part of a character.

Understanding Char and Unicode behavior is essential for correct text processing and internationalization.