0
0
Pandasdata~15 mins

Selecting multiple columns in Pandas - Deep Dive

Choose your learning style9 modes available
Overview - Selecting multiple columns
What is it?
Selecting multiple columns means choosing more than one column from a table of data to work with. In pandas, a popular tool for data analysis in Python, data is stored in tables called DataFrames. Picking multiple columns helps you focus on just the parts of the data you need for your task.
Why it matters
Without the ability to select multiple columns, you would have to work with the entire dataset every time, which can be slow and confusing. Selecting only the columns you need makes your work faster, clearer, and less error-prone. It helps you answer questions like 'What are the sales and profit numbers?' without extra clutter.
Where it fits
Before learning this, you should know how to create and understand DataFrames in pandas. After this, you can learn how to filter rows, perform calculations on columns, and visualize data based on selected columns.
Mental Model
Core Idea
Selecting multiple columns is like picking specific ingredients from a kitchen shelf to make a recipe, focusing only on what you need.
Think of it like...
Imagine a grocery store aisle with many shelves full of items (columns). Selecting multiple columns is like grabbing a few specific items from different shelves to prepare a meal, instead of taking everything.
DataFrame
┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Column A   │ Column B   │ Column C   │ Column D   │
├─────────────┼─────────────┼─────────────┼─────────────┤
│    1       │    4       │    7       │    10      │
│    2       │    5       │    8       │    11      │
│    3       │    6       │    9       │    12      │
└─────────────┴─────────────┴─────────────┴─────────────┘

Selecting Columns B and D:

┌─────────────┬─────────────┐
│ Column B   │ Column D   │
├─────────────┼─────────────┤
│    4       │    10      │
│    5       │    11      │
│    6       │    12      │
└─────────────┴─────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding DataFrame columns
🤔
Concept: Learn what columns are in a pandas DataFrame and how they hold data.
A DataFrame is like a table with rows and columns. Each column has a name and contains data of a certain type, like numbers or words. You can see all column names by using df.columns in pandas.
Result
You can list all column names, for example: ['Name', 'Age', 'City']
Knowing what columns are helps you understand how data is organized and why selecting columns is useful.
2
FoundationSelecting a single column
🤔
Concept: Learn how to pick one column from a DataFrame.
To select one column, you can use df['ColumnName'] or df.ColumnName. This gives you a Series, which is like a single column of data.
Result
Selecting df['Age'] returns the 'Age' column as a Series.
Mastering single column selection is the first step before selecting multiple columns.
3
IntermediateSelecting multiple columns with a list
🤔Before reading on: do you think selecting multiple columns uses parentheses or square brackets? Commit to your answer.
Concept: You can select multiple columns by passing a list of column names inside double square brackets.
Use df[['Column1', 'Column2']] to get a new DataFrame with only those columns. For example, df[['Age', 'City']] returns a DataFrame with just 'Age' and 'City'.
Result
A smaller DataFrame with only the chosen columns appears.
Understanding that double brackets and a list are needed prevents common syntax errors.
4
IntermediateSelecting columns using .loc and .iloc
🤔Before reading on: do you think .loc selects columns by position or by label? Commit to your answer.
Concept: .loc selects columns by their names (labels), while .iloc selects by their position (index).
df.loc[:, ['Column1', 'Column2']] selects columns by name. df.iloc[:, [0, 2]] selects columns by position (first and third columns). The ':' means all rows.
Result
You get a DataFrame with the specified columns selected either by name or position.
Knowing both label and position selection methods gives flexibility in different situations.
5
IntermediateSelecting columns with conditions
🤔Before reading on: can you select columns based on their data type? Commit to your answer.
Concept: You can select columns based on conditions like their data type or name patterns.
Use df.select_dtypes(include=['number']) to select only numeric columns. Or use list comprehension to select columns with names starting with a letter: [col for col in df.columns if col.startswith('A')].
Result
A DataFrame with columns matching the condition is returned.
Selecting columns by condition helps automate data selection in large datasets.
6
AdvancedSelecting columns with callable functions
🤔Before reading on: do you think pandas allows functions to select columns dynamically? Commit to your answer.
Concept: You can pass a function to .loc or .filter to select columns dynamically based on logic.
For example, df.filter(items=lambda x: 'Age' in x) selects columns whose names contain 'Age'. Or df.loc[:, lambda df: df.columns.str.contains('Age')] does the same.
Result
Columns matching the function's logic are selected.
Using functions for selection enables powerful, flexible data manipulation.
7
ExpertPerformance and memory considerations in selection
🤔Before reading on: does selecting columns create a copy or a view of the data? Commit to your answer.
Concept: Selecting columns may return a view or a copy, affecting memory and performance.
In pandas, selecting multiple columns usually returns a copy, meaning changes to the selection do not affect the original DataFrame. This can use more memory but prevents accidental data changes. Understanding this helps avoid bugs and optimize performance.
Result
You know when changes to selected columns affect the original data or not.
Understanding copy vs view behavior prevents subtle bugs and helps write efficient code.
Under the Hood
When you select multiple columns using df[['col1', 'col2']], pandas creates a new DataFrame object containing copies of the selected columns. Internally, pandas stores data in blocks grouped by data type. Selecting columns extracts these blocks or parts of them to form the new DataFrame. This copying ensures the original data stays safe from unintended changes.
Why designed this way?
Pandas was designed to balance ease of use and safety. Returning copies when selecting multiple columns avoids accidental changes to the original data, which can cause hard-to-find bugs. Alternatives like always returning views would be faster but risk data corruption. This design choice favors reliability over raw speed.
DataFrame (original)
┌─────────────┬─────────────┬─────────────┐
│ col1 (int) │ col2 (str) │ col3 (int) │
├─────────────┼─────────────┼─────────────┤
│    10      │   'a'      │    100     │
│    20      │   'b'      │    200     │
└─────────────┴─────────────┴─────────────┘

Selecting ['col1', 'col3']
  ↓ pandas copies data blocks for col1 and col3

New DataFrame (copy)
┌─────────────┬─────────────┐
│ col1 (int) │ col3 (int) │
├─────────────┼─────────────┤
│    10      │    100     │
│    20      │    200     │
└─────────────┴─────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does df['col1', 'col2'] select multiple columns? Commit to yes or no.
Common Belief:Using df['col1', 'col2'] selects multiple columns at once.
Tap to reveal reality
Reality:This syntax causes an error because df[] expects a single argument. To select multiple columns, you must pass a list inside df[], like df[['col1', 'col2']].
Why it matters:Using wrong syntax leads to errors and confusion, wasting time and blocking progress.
Quick: Does selecting columns always return a view of the original data? Commit to yes or no.
Common Belief:Selecting columns returns a view, so changes affect the original DataFrame.
Tap to reveal reality
Reality:Selecting multiple columns usually returns a copy, so changes to the selection do not affect the original DataFrame.
Why it matters:Assuming a view can cause unexpected bugs when changes don't reflect back, or vice versa.
Quick: Can you select columns by position using df[['0', '1']]? Commit to yes or no.
Common Belief:You can select columns by their position using their index as strings in a list.
Tap to reveal reality
Reality:Column selection by position requires .iloc with integer indices, not string names. df[['0', '1']] looks for columns named '0' and '1', which usually don't exist.
Why it matters:Confusing labels and positions leads to wrong data being selected or errors.
Quick: Does df.loc[:, 'col1':'col3'] select columns col1, col2, and col3? Commit to yes or no.
Common Belief:Using df.loc with a slice of column names selects all columns between col1 and col3 inclusive.
Tap to reveal reality
Reality:Yes, .loc supports label slicing and includes both ends, but this only works if columns are sorted and sliceable by label.
Why it matters:Misunderstanding slicing behavior can cause missing or extra columns in selection.
Expert Zone
1
Selecting columns by label with .loc is inclusive of the end label, unlike Python slicing which excludes the end.
2
When selecting columns with mixed data types, pandas may create multiple internal blocks, affecting performance and memory.
3
Using .filter with regex patterns can select columns dynamically but may be slower on very large DataFrames.
When NOT to use
Selecting multiple columns is not ideal when you need to select rows based on conditions; use boolean indexing instead. For very large datasets, consider using libraries like Dask or Vaex that handle out-of-memory data better.
Production Patterns
In real-world projects, selecting multiple columns is often combined with chaining methods like filtering rows, applying functions, and grouping data. It is common to select columns early to reduce data size and improve performance before heavy computations.
Connections
SQL SELECT statement
Similar pattern of choosing specific columns from a table.
Understanding pandas column selection helps grasp SQL SELECT queries, as both focus on extracting relevant data fields.
Relational database schema design
Selecting columns relates to understanding table structure and relationships.
Knowing how columns represent attributes in databases aids in designing efficient data queries and selections.
User interface design - filtering options
Selecting columns is like choosing filters or options in a UI to customize views.
Recognizing this connection helps appreciate how data selection improves user experience by focusing on relevant information.
Common Pitfalls
#1Using single brackets with multiple column names causes errors.
Wrong approach:df['Age', 'City']
Correct approach:df[['Age', 'City']]
Root cause:Misunderstanding that df[] expects one argument; multiple columns must be passed as a list inside df[].
#2Trying to select columns by position using string indices.
Wrong approach:df[['0', '1']]
Correct approach:df.iloc[:, [0, 1]]
Root cause:Confusing column labels (names) with their integer positions.
#3Assuming changes to selected columns affect original DataFrame.
Wrong approach:subset = df[['Age', 'City']] subset['Age'] = subset['Age'] + 1
Correct approach:# To change original, modify df directly df.loc[:, 'Age'] = df['Age'] + 1
Root cause:Not realizing that selecting multiple columns returns a copy, so changes to the copy don't affect the original.
Key Takeaways
Selecting multiple columns in pandas is done by passing a list of column names inside double square brackets.
You can select columns by label using .loc or by position using .iloc for more control.
Selecting columns usually returns a copy, protecting the original data from unintended changes.
Selecting columns by condition or with functions allows flexible and dynamic data manipulation.
Understanding the difference between labels and positions prevents common selection errors.