0
0
Pandasdata~15 mins

info() for column types and nulls in Pandas - Deep Dive

Choose your learning style9 modes available
Overview - info() for column types and nulls
What is it?
The info() function in pandas is a quick way to see a summary of a DataFrame. It shows the number of rows, columns, data types of each column, and how many non-null values each column has. This helps you understand the structure and completeness of your data at a glance.
Why it matters
Without info(), you might waste time guessing what your data looks like or miss important details like missing values or wrong data types. This function helps you spot problems early, so you can clean and prepare your data correctly before analysis. It saves time and prevents errors in your work.
Where it fits
Before using info(), you should know how to load data into pandas DataFrames. After learning info(), you can move on to handling missing data, converting data types, and exploring data with other pandas functions.
Mental Model
Core Idea
info() is like a quick health check that tells you the shape, type, and completeness of your data columns.
Think of it like...
Imagine info() as a doctor’s quick checkup report for your dataset, showing which parts are healthy (complete) and which parts need attention (missing or wrong types).
┌───────────────────────────────┐
│ DataFrame info() summary      │
├─────────────┬───────────────┤
│ Column Name │ Data Type     │
├─────────────┼───────────────┤
│ col1        │ int64         │
│ col2        │ float64       │
│ col3        │ object (text) │
├─────────────┴───────────────┤
│ Non-null counts per column    │
│ Total rows: 1000             │
│ col1: 1000 non-null          │
│ col2: 950 non-null           │
│ col3: 1000 non-null          │
└───────────────────────────────┘
Build-Up - 6 Steps
1
FoundationWhat info() Shows by Default
🤔
Concept: Learn the basic output of info() including row count, column count, data types, and non-null counts.
When you call df.info() on a DataFrame, it prints the number of rows and columns, lists each column with its data type, and shows how many non-null values each column has. This helps you quickly see if any columns have missing data and what type of data each column holds.
Result
A summary printout showing total rows, columns, each column's data type, and non-null counts.
Understanding the default info() output is the first step to quickly assessing your dataset’s structure and completeness.
2
FoundationData Types and Null Counts Explained
🤔
Concept: Understand what data types and non-null counts mean in the context of a DataFrame.
Data types tell you what kind of data is stored in each column, like numbers or text. Non-null counts tell you how many values are present and not missing. For example, if a column has 1000 rows but only 950 non-null, it means 50 values are missing.
Result
Clear understanding of what data types and null counts represent in your data.
Knowing these basics helps you identify columns that need cleaning or type conversion.
3
IntermediateUsing info() with Memory Usage Details
🤔Before reading on: do you think info() shows memory usage by default or do you need to ask for it? Commit to your answer.
Concept: Learn how to get memory usage details from info() to understand your DataFrame’s size in memory.
By default, info() shows memory usage. But if you call df.info(memory_usage='deep'), it will estimate how much memory each column uses, including the memory for text data. This helps when working with large datasets to optimize memory.
Result
info() output now includes memory usage per column and total memory used.
Knowing memory usage helps you manage resources and optimize performance when working with big data.
4
Intermediateinfo() with Verbose and Null Counts Options
🤔Before reading on: do you think info() always shows all columns or only some? Commit to your answer.
Concept: Discover how to control info() output with verbose and null_counts parameters.
By default, info() may truncate columns if there are many. Setting verbose=True shows all columns. Also, null_counts=True (in older pandas versions) explicitly shows counts of missing values. These options give you more control over the summary details.
Result
Full detailed info() output showing all columns and explicit null counts.
Customizing info() output helps you get exactly the summary you need for your data.
5
AdvancedInterpreting info() for Mixed Data Types
🤔Before reading on: do you think a column with mixed types shows a single data type or multiple? Commit to your answer.
Concept: Learn how info() reports columns with mixed or unexpected data types and what that means for your data quality.
If a column has mixed types (e.g., numbers and text), info() usually shows it as 'object'. This means you might have inconsistent data that needs cleaning. Recognizing this helps you decide if you need to convert or clean that column.
Result
info() output shows 'object' type for mixed columns, signaling potential data issues.
Spotting mixed types early prevents bugs and errors in analysis caused by inconsistent data.
6
Expertinfo() Internals and Performance Considerations
🤔Before reading on: do you think info() reads all data or just metadata to produce its summary? Commit to your answer.
Concept: Understand how info() works internally and why it is fast even on large datasets.
info() does not scan every value in the DataFrame. It uses stored metadata about data types and counts of non-null values, which pandas tracks as data is loaded or modified. This design makes info() very fast and lightweight, even for big data.
Result
You know info() is efficient because it relies on metadata, not full data scans.
Understanding info() internals helps you trust its speed and know when it might not reflect recent changes if metadata is outdated.
Under the Hood
info() accesses the DataFrame's internal metadata, including the index size, column data types, and counts of non-null values stored in pandas' optimized data structures. It does not iterate over all data values but uses this metadata to quickly summarize the DataFrame. When memory_usage='deep' is requested, it performs a deeper scan of object columns to estimate memory usage more accurately.
Why designed this way?
info() was designed to provide a fast, lightweight summary without the cost of scanning all data. This allows users to quickly check data health even on large datasets. Alternatives that scan all data would be too slow and impractical for big data workflows.
┌───────────────────────────────┐
│ pandas DataFrame object        │
├───────────────┬───────────────┤
│ Metadata      │ Data Storage  │
│ - index size  │ - column data │
│ - dtypes      │ - values      │
│ - non-null counts             │
├───────────────┴───────────────┤
│ info() reads metadata only    │
│ If memory_usage='deep'        │
│   scans object columns deeply │
└───────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does info() show the exact number of missing values by default? Commit to yes or no.
Common Belief:info() shows the exact count of missing values for each column by default.
Tap to reveal reality
Reality:info() shows the count of non-null values, not missing values directly. You must subtract non-null count from total rows to get missing counts.
Why it matters:Assuming info() shows missing counts directly can lead to misunderstanding how much data is missing and cause wrong cleaning decisions.
Quick: Do you think info() inspects every data value to determine data types? Commit to yes or no.
Common Belief:info() scans all data values to determine the data type of each column.
Tap to reveal reality
Reality:info() uses stored metadata about data types, not scanning all values, which makes it fast.
Why it matters:Believing info() scans all data can cause confusion about its speed and when it might not reflect recent data changes.
Quick: Does info() always show all columns regardless of DataFrame size? Commit to yes or no.
Common Belief:info() always displays every column in the DataFrame no matter how many there are.
Tap to reveal reality
Reality:By default, info() may truncate columns if there are many, showing a summary instead. You can set verbose=True to see all columns.
Why it matters:Not knowing this can cause you to miss columns in the summary and overlook important data issues.
Quick: Can info() detect mixed data types within a single column? Commit to yes or no.
Common Belief:info() clearly shows if a column has mixed data types by listing all types.
Tap to reveal reality
Reality:info() shows such columns as 'object' type without detailing the mix, so you need further checks to find mixed types.
Why it matters:Assuming info() reveals mixed types fully can lead to missed data quality problems.
Expert Zone
1
info() relies on pandas' internal metadata which can become outdated if you manipulate data with certain operations, so sometimes info() might not reflect the latest state until you refresh or reload data.
2
Memory usage estimation with memory_usage='deep' can be expensive on large object columns, so use it selectively when you need detailed memory profiling.
3
info() output format and parameters have evolved across pandas versions, so knowing your pandas version helps interpret info() results correctly.
When NOT to use
info() is not suitable when you need detailed statistics like exact missing value counts, unique values, or distribution summaries. Use df.describe(), df.isnull().sum(), or df.value_counts() for those tasks instead.
Production Patterns
In real-world data pipelines, info() is often used in automated data validation scripts to quickly check data integrity before processing. It is combined with logging to alert teams about missing data or unexpected data types early in the workflow.
Connections
Data Cleaning
info() output guides data cleaning by revealing missing values and data types.
Knowing how info() highlights nulls and types helps you decide which columns need cleaning or type conversion.
Database Schema Inspection
Both info() and database schema tools summarize data structure and types.
Understanding info() helps you grasp how databases describe tables, aiding smoother data integration.
System Health Monitoring
info() acts like a health check for data, similar to how system monitors check server status.
Seeing info() as a health check helps prioritize fixing data issues like missing values, just as sysadmins fix server alerts.
Common Pitfalls
#1Assuming info() shows missing values directly.
Wrong approach:df.info() # User reads non-null counts as missing counts
Correct approach:df.info() missing_counts = len(df) - df.count() print(missing_counts)
Root cause:Confusing non-null counts with missing counts leads to wrong assumptions about data completeness.
#2Expecting info() to show all columns when DataFrame is wide.
Wrong approach:df.info() # User misses columns because output is truncated
Correct approach:df.info(verbose=True) # Shows all columns regardless of number
Root cause:Not knowing about verbose option causes missing important column info.
#3Using info() to check detailed data quality like unique values or distributions.
Wrong approach:df.info() # User expects detailed stats from info()
Correct approach:df.describe() df.value_counts() # Use these for detailed statistics
Root cause:Misunderstanding info() as a full data profiling tool rather than a summary.
Key Takeaways
info() provides a fast summary of a DataFrame’s shape, data types, and non-null counts to quickly assess data health.
It uses stored metadata, not scanning all data, which makes it efficient even for large datasets.
info() shows non-null counts, so you must subtract from total rows to find missing values.
Customizing info() with parameters like verbose and memory_usage helps tailor the summary to your needs.
Understanding info() output guides data cleaning, type conversion, and memory optimization in real-world data science.