What is the output of the following Python code that optimizes a DataFrame column by converting it to a categorical type?
import pandas as pd import numpy as np # Create a DataFrame with 100000 rows np.random.seed(0) data = pd.DataFrame({ 'color': np.random.choice(['red', 'green', 'blue'], size=100000) }) # Memory usage before conversion mem_before = data.memory_usage(deep=True).sum() # Convert 'color' column to categorical data['color'] = data['color'].astype('category') # Memory usage after conversion mem_after = data.memory_usage(deep=True).sum() print(round(mem_before), round(mem_after))
Think about how categorical data stores repeated values efficiently compared to strings.
Converting a string column to categorical reduces memory usage significantly because it stores unique values once and uses integer codes for each row.
Given a DataFrame with a 'city' column converted to categorical, what is the number of unique categories stored?
import pandas as pd cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'New York', 'Chicago'] data = pd.DataFrame({'city': cities}) data['city'] = data['city'].astype('category') num_categories = len(data['city'].cat.categories) print(num_categories)
Count unique city names ignoring duplicates.
The unique cities are New York, Los Angeles, Chicago, Houston, and Phoenix, totaling 5 categories.
What error does the following code produce?
import pandas as pd values = ['apple', 'banana', 'apple', 'orange'] data = pd.DataFrame({'fruit': values}) data['fruit'] = data['fruit'].astype('category', categories=['apple', 'banana'])
Check the correct way to specify categories when converting to categorical.
The astype() method does not accept a 'categories' argument. Categories must be set using pd.Categorical or cat.set_categories().
You have a dataset with a column 'status' containing 3 unique values repeated millions of times. Which approach optimizes memory best?
Think about how categorical dtype stores repeated values efficiently.
Category dtype stores unique values once and uses integer codes internally, saving memory for repeated values.
How does setting the ordered=True parameter in a categorical column affect sorting performance in pandas?
Consider how ordered categories allow direct integer comparison during sorting.
When categories are ordered, pandas can sort by comparing integer codes directly, which is faster than sorting strings.