0
0
Pandasdata~20 mins

Handling encoding issues in Pandas - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Encoding Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this code when reading a CSV with encoding errors?

Consider a CSV file with some invalid UTF-8 bytes. What will be the output of the following code snippet?

Pandas
import pandas as pd
from io import BytesIO

bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22'

try:
    df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8')
    result = df.to_dict()
except Exception as e:
    result = str(type(e))
A{'name': {0: 'Alice', 1: 'Bob', 2: 'José'}, 'age': {0: 30, 1: 25, 2: 22}}
B{'name': {0: 'Alice', 1: 'Bob', 2: 'Jos\xE9'}, 'age': {0: 30, 1: 25, 2: 22}}
C<class 'ParserError'>
D<class 'UnicodeDecodeError'>
Attempts:
2 left
💡 Hint

Think about what happens when pandas tries to decode bytes that are not valid UTF-8.

data_output
intermediate
2:00remaining
What is the content of the DataFrame after reading with errors='replace'?

Given the same CSV data with invalid UTF-8 bytes, what will be the DataFrame content if we use errors='replace' in read_csv?

Pandas
import pandas as pd
from io import BytesIO

bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22'

df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8', encoding_errors='replace')
result = df.to_dict()
A{'name': {0: 'Alice', 1: 'Bob', 2: 'Jos�'}, 'age': {0: 30, 1: 25, 2: 22}}
B{'name': {0: 'Alice', 1: 'Bob', 2: 'Jos\xE9'}, 'age': {0: 30, 1: 25, 2: 22}}
C<class 'UnicodeDecodeError'>
D{'name': {0: 'Alice', 1: 'Bob', 2: 'José'}, 'age': {0: 30, 1: 25, 2: 22}}
Attempts:
2 left
💡 Hint

Using errors='replace' replaces invalid bytes with a special character.

🔧 Debug
advanced
2:00remaining
Why does this code raise a UnicodeDecodeError?

Examine the code below. Why does it raise a UnicodeDecodeError?

Pandas
import pandas as pd
from io import BytesIO

bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22'

try:
    df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8')
    result = df.to_dict()
except Exception as e:
    result = str(type(e))
ABecause the BytesIO object is not supported by pandas
BBecause the byte \xe9 is not valid UTF-8 on its own
CBecause the CSV header is missing
DBecause the data contains a missing value
Attempts:
2 left
💡 Hint

Check the encoding of the byte \xe9 in UTF-8.

🚀 Application
advanced
2:00remaining
How to correctly read a CSV with Latin-1 encoding containing special characters?

You have a CSV file encoded in Latin-1 with names containing accented characters. Which code snippet correctly reads it into a DataFrame preserving the characters?

Apd.read_csv('file.csv', encoding='latin-1')
Bpd.read_csv('file.csv', encoding='utf-8')
Cpd.read_csv('file.csv', encoding='ascii')
Dpd.read_csv('file.csv', encoding='utf-16')
Attempts:
2 left
💡 Hint

Use the encoding that matches the file's actual encoding.

🧠 Conceptual
expert
2:00remaining
What is the effect of using encoding='utf-8-sig' when reading a CSV file?

When reading a CSV file that starts with a Byte Order Mark (BOM), what does specifying encoding='utf-8-sig' do?

AIt causes a UnicodeDecodeError because BOM is not supported
BIt treats the file as ASCII encoding ignoring UTF-8 characters
CIt removes the BOM from the start of the file so it does not appear in the data
DIt converts all characters to uppercase
Attempts:
2 left
💡 Hint

Think about what BOM means and how utf-8-sig handles it.