Consider a CSV file with some invalid UTF-8 bytes. What will be the output of the following code snippet?
import pandas as pd from io import BytesIO bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22' try: df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8') result = df.to_dict() except Exception as e: result = str(type(e))
Think about what happens when pandas tries to decode bytes that are not valid UTF-8.
The string contains an invalid UTF-8 byte sequence (\xE9 alone is not valid UTF-8). When pandas tries to read it with encoding='utf-8', it raises a UnicodeDecodeError.
Given the same CSV data with invalid UTF-8 bytes, what will be the DataFrame content if we use errors='replace' in read_csv?
import pandas as pd from io import BytesIO bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22' df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8', encoding_errors='replace') result = df.to_dict()
Using errors='replace' replaces invalid bytes with a special character.
The invalid byte is replaced by the Unicode replacement character � (U+FFFD), so the name becomes 'Jos�'.
Examine the code below. Why does it raise a UnicodeDecodeError?
import pandas as pd from io import BytesIO bytes_data = b'name,age\nAlice,30\nBob,25\nJos\xe9,22' try: df = pd.read_csv(BytesIO(bytes_data), encoding='utf-8') result = df.to_dict() except Exception as e: result = str(type(e))
Check the encoding of the byte \xe9 in UTF-8.
The byte \xe9 alone is not a valid UTF-8 character. UTF-8 encodes 'é' as two bytes: \xc3\xa9. Hence, decoding fails.
You have a CSV file encoded in Latin-1 with names containing accented characters. Which code snippet correctly reads it into a DataFrame preserving the characters?
Use the encoding that matches the file's actual encoding.
Since the file is encoded in Latin-1, specifying encoding='latin-1' correctly decodes special characters.
When reading a CSV file that starts with a Byte Order Mark (BOM), what does specifying encoding='utf-8-sig' do?
Think about what BOM means and how utf-8-sig handles it.
The utf-8-sig encoding detects and removes the BOM if present, preventing it from appearing as part of the data.