0
0
Intro to Computingfundamentals~15 mins

How text is stored (ASCII, Unicode) in Intro to Computing - Mechanics & Internals

Choose your learning style9 modes available
Overview - How text is stored (ASCII, Unicode)
What is it?
Text in computers is stored as numbers that represent letters, symbols, and characters. ASCII and Unicode are systems that assign these numbers to characters so computers can understand and display text. ASCII uses numbers from 0 to 127 for basic English characters, while Unicode covers almost all characters from all languages worldwide. This allows computers to show text correctly no matter the language or symbol.
Why it matters
Without a standard way to store text, computers would not understand each other or display words correctly. Imagine sending a message where letters turn into strange symbols or question marks. ASCII and Unicode solve this by giving every character a unique number, making communication and reading on computers reliable and universal. This is why you can read emails, websites, and documents in many languages on any device.
Where it fits
Before learning this, you should understand basic computer data like bits and bytes. After this, you can learn about text encoding formats like UTF-8 and UTF-16, which are ways to save Unicode characters efficiently. This topic fits into the broader study of how computers handle data and communicate.
Mental Model
Core Idea
Text is stored as numbers where each number stands for a specific character, and ASCII and Unicode are the main systems that map characters to these numbers.
Think of it like...
Think of text storage like a library catalog where each book (character) has a unique number (code). ASCII is a small catalog for English books, while Unicode is a huge catalog covering books from every language and symbol you can imagine.
┌───────────────┐
│ Character Set │
├───────────────┤
│ ASCII (0-127) │───> Basic English letters, digits, symbols
│ Unicode       │───> All world languages, emojis, symbols
└───────────────┘

Character → Number → Stored as bits in computer memory
Build-Up - 7 Steps
1
FoundationWhat is text in computers
🤔
Concept: Text is stored as numbers inside computers because computers only understand numbers.
Every letter, number, or symbol you see on a screen is actually stored as a number. For example, the letter 'A' is stored as the number 65. Computers use these numbers to show the correct characters on your screen.
Result
You understand that text is not stored as letters but as numbers that represent letters.
Knowing that text is stored as numbers helps you understand why we need systems like ASCII and Unicode to map characters to numbers.
2
FoundationBits and bytes basics
🤔
Concept: Computers store data in bits and bytes, which are groups of bits.
A bit is a tiny piece of data that can be 0 or 1. Eight bits make a byte. Each character is stored as one or more bytes. For example, ASCII uses one byte per character.
Result
You see how characters are stored as bytes made of bits.
Understanding bits and bytes is essential because text encoding assigns numbers that fit into these bytes.
3
IntermediateASCII: The original text code
🤔Before reading on: do you think ASCII can represent characters from all languages or just English? Commit to your answer.
Concept: ASCII is a system that assigns numbers from 0 to 127 to English letters, digits, and some symbols.
ASCII stands for American Standard Code for Information Interchange. It uses 7 bits to represent characters like A-Z, a-z, digits 0-9, and some punctuation. For example, 'A' is 65, 'a' is 97, and '0' is 48.
Result
You learn that ASCII is limited to basic English characters and some control codes.
Knowing ASCII's limits explains why a bigger system like Unicode was needed for global text.
4
IntermediateUnicode: Universal character set
🤔Before reading on: do you think Unicode uses fixed or variable length codes for characters? Commit to your answer.
Concept: Unicode assigns a unique number to every character from almost all languages and symbols worldwide.
Unicode can represent over a million characters. It includes alphabets, emojis, symbols, and scripts from all languages. Unicode numbers are called code points and look like U+0041 for 'A'. Unicode can be stored in different ways like UTF-8 or UTF-16, which use variable bytes per character.
Result
You understand that Unicode solves the problem of representing global text in computers.
Understanding Unicode's vast range and flexibility is key to handling modern text data correctly.
5
IntermediateEncoding formats: UTF-8 and UTF-16
🤔Before reading on: do you think UTF-8 uses the same number of bytes for all characters? Commit to your answer.
Concept: UTF-8 and UTF-16 are ways to save Unicode characters using different numbers of bytes depending on the character.
UTF-8 uses 1 to 4 bytes per character and is backward compatible with ASCII. UTF-16 uses 2 or 4 bytes per character. These formats help save space and support all Unicode characters. For example, English letters use 1 byte in UTF-8, but emojis use 4 bytes.
Result
You learn how Unicode characters are stored efficiently in memory and files.
Knowing encoding formats helps you understand file sizes and why some text files look strange if opened with the wrong encoding.
6
AdvancedWhy ASCII alone is not enough
🤔Before reading on: do you think ASCII can represent emojis or Chinese characters? Commit to your answer.
Concept: ASCII cannot represent characters beyond basic English, so it fails for global text and modern symbols.
ASCII only covers 128 characters, missing accented letters, symbols, and non-English alphabets. This causes problems like garbled text or question marks when showing other languages. Unicode was created to fix this by including all characters.
Result
You see the real-world limitations of ASCII and why Unicode is essential.
Understanding ASCII's limits prevents errors in software that only supports ASCII and helps appreciate Unicode's role.
7
ExpertUnicode normalization and challenges
🤔Before reading on: do you think the same character can have multiple Unicode codes? Commit to your answer.
Concept: Unicode allows multiple ways to represent the same character, which can cause confusion and requires normalization.
Some characters can be written as a single code point or as a combination of base character plus accents. For example, 'é' can be U+00E9 or 'e' (U+0065) plus an accent mark (U+0301). Software must normalize text to compare or search correctly. This adds complexity to text processing.
Result
You understand a subtle but important challenge in Unicode text handling.
Knowing normalization helps avoid bugs in text comparison, searching, and sorting in multilingual applications.
Under the Hood
Computers store text as sequences of bits. Each character is assigned a number called a code point. ASCII uses 7 bits per character, fitting into one byte. Unicode assigns code points that can be stored in 1 to 4 bytes depending on encoding (UTF-8, UTF-16). When reading or writing text, software converts between characters and their numeric codes, then to bits stored in memory or files.
Why designed this way?
ASCII was designed in the 1960s for English text with limited memory and simple hardware, so it used 7 bits. As computing globalized, ASCII's limits became clear, leading to Unicode's creation in the 1990s to support all languages and symbols. Unicode's design balances backward compatibility, extensibility, and efficient storage with variable-length encodings.
┌───────────────┐
│ Character 'A' │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ ASCII Code 65 │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Binary 01000001│
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Stored in RAM │
│ as 1 byte     │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does ASCII support accented letters like 'é'? Commit to yes or no.
Common Belief:ASCII can represent all letters including accented ones.
Tap to reveal reality
Reality:ASCII only supports basic English letters without accents.
Why it matters:Using ASCII for accented letters causes wrong characters or errors in text display.
Quick: Is Unicode a single fixed-length code per character? Commit to yes or no.
Common Belief:Unicode assigns one fixed-length code to every character.
Tap to reveal reality
Reality:Unicode code points are fixed-length numbers, but their storage length varies with encoding; encodings like UTF-8 use variable bytes.
Why it matters:Assuming fixed length leads to bugs in text processing and storage size miscalculations.
Quick: Can the same visible character have different Unicode codes? Commit to yes or no.
Common Belief:Each character has only one unique Unicode code point.
Tap to reveal reality
Reality:Some characters can be represented by multiple code point sequences requiring normalization.
Why it matters:Ignoring this causes errors in searching, sorting, and comparing text.
Quick: Does UTF-8 only work for English text? Commit to yes or no.
Common Belief:UTF-8 is only for English or ASCII characters.
Tap to reveal reality
Reality:UTF-8 can encode all Unicode characters and is widely used globally.
Why it matters:Misunderstanding UTF-8 limits can cause wrong assumptions about file compatibility.
Expert Zone
1
Unicode includes private use areas where companies can define their own characters, which can cause compatibility issues.
2
UTF-8's design allows ASCII characters to be stored as single bytes, making it backward compatible and efficient for English text.
3
Normalization forms (NFC, NFD) affect how text is stored and compared, impacting database indexing and search accuracy.
When NOT to use
ASCII should not be used when working with non-English text or symbols; instead, use Unicode with UTF-8 encoding. For legacy systems limited to ASCII, conversion or transliteration may be necessary. Avoid fixed-width Unicode encodings like UTF-32 for storage due to inefficiency unless random access to characters is critical.
Production Patterns
In real-world systems, UTF-8 is the standard encoding for web pages, databases, and APIs because it balances compatibility and efficiency. Software often normalizes Unicode text before processing to avoid subtle bugs. Legacy ASCII data is converted to Unicode for internationalization. Developers must handle encoding explicitly when reading or writing files to prevent mojibake (garbled text).
Connections
Data Compression
Encoding text efficiently relates to compressing data by reducing storage size.
Understanding variable-length encodings like UTF-8 helps grasp how compression algorithms save space by using shorter codes for common data.
Human Languages and Linguistics
Unicode supports scripts and symbols from all human languages, connecting computing to linguistics.
Knowing Unicode's role reveals how computing adapts to human diversity and language complexity.
Library Cataloging Systems
Assigning unique codes to characters is like cataloging books with unique IDs.
This cross-domain link shows how organizing information with unique identifiers is a universal problem solved similarly in different fields.
Common Pitfalls
#1Assuming all text files use ASCII encoding.
Wrong approach:Opening a UTF-8 encoded file as ASCII and reading bytes directly without decoding.
Correct approach:Always specify UTF-8 encoding when reading or writing text files that may contain non-ASCII characters.
Root cause:Misunderstanding that ASCII is a subset of Unicode and that UTF-8 is the common encoding for Unicode text.
#2Comparing Unicode strings without normalization.
Wrong approach:if (string1 == string2) { /* equal */ } without normalizing strings first.
Correct approach:Normalize both strings using NFC or NFD before comparison to ensure equivalence.
Root cause:Ignoring that the same character can have multiple Unicode representations.
#3Using fixed-width encoding like UTF-32 for all text storage.
Wrong approach:Storing all text in UTF-32 to simplify indexing without considering size.
Correct approach:Use UTF-8 for storage and UTF-32 only when fixed-width access is necessary and storage cost is acceptable.
Root cause:Not balancing storage efficiency with access needs.
Key Takeaways
Text in computers is stored as numbers representing characters using systems like ASCII and Unicode.
ASCII covers basic English characters using 7 bits, but it cannot represent global languages or symbols.
Unicode assigns unique codes to characters from almost all languages and symbols, enabling universal text representation.
Encoding formats like UTF-8 store Unicode characters efficiently using variable bytes per character.
Handling Unicode correctly requires understanding normalization and encoding to avoid bugs in text processing.