Overview - Selecting rows (loc, iloc)

What is it?

Selecting rows in data analysis means choosing specific rows from a table of data. In Python's pandas library, two common ways to do this are loc and iloc. loc selects rows by their labels or names, while iloc selects rows by their position or number. This helps you focus on the data you want to analyze or change.

Why it matters

Without ways to select rows easily, working with large tables would be slow and confusing. You might have to look at all data even if you need just a few rows. loc and iloc let you quickly pick exactly what you want, saving time and avoiding mistakes. This makes data analysis faster and more accurate.

Where it fits

Before learning loc and iloc, you should know what a DataFrame is and how data is organized in rows and columns. After mastering row selection, you can learn about selecting columns, filtering data with conditions, and modifying data. This is a key step in exploring and cleaning data.

Mental Model

Core Idea

loc selects rows by their names or labels, iloc selects rows by their position number.

Think of it like...

Imagine a book with pages numbered from 1 to 100. iloc is like picking pages by their page number, while loc is like picking pages by the chapter title or section name written on the page.

DataFrame rows:
┌─────────────┐
│ Index Label │  <- loc uses these labels
├─────────────┤
│     A       │
│     B       │
│     C       │
└─────────────┘

Positions:
┌─────────────┐
│ Position #  │  <- iloc uses these numbers
├─────────────┤
│     0       │
│     1       │
│     2       │
└─────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding DataFrame Rows and Index

Concept: Learn what rows and index labels are in a DataFrame.

A DataFrame is like a table with rows and columns. Each row has a label called an index. By default, these labels are numbers starting from 0, but they can be any names or numbers. The index helps us find rows easily.

Result

You can identify rows by their position (0,1,2...) or by their label (like 'A', 'B', 'C').

Knowing that rows have labels (index) and positions helps you understand why there are two ways to select rows.

2

FoundationBasic Row Selection with iloc

3

IntermediateBasic Row Selection with loc

4

IntermediateDifferences Between loc and iloc

5

IntermediateSelecting Rows with Lists and Conditions

6

AdvancedHandling Missing or Non-Unique Indexes

7

ExpertPerformance and Internals of loc and iloc

Under the Hood

iloc works by directly accessing rows by their integer position in memory, like an array index. loc first looks up the label in the DataFrame's index structure, which can be a hash map or tree, to find the matching row positions. Then it retrieves those rows. This lookup step adds overhead but allows flexible label-based access.

Why designed this way?

pandas was designed to handle both positional and label-based indexing because data often has meaningful labels but also needs fast numeric access. The dual system balances ease of use and performance. Alternatives like only numeric or only label indexing would limit flexibility or speed.

DataFrame
┌───────────────────────────────┐
│ Index (labels) ──────────────┐│
│ ┌───────────────┐            ││
│ │ 'A' 'B' 'C'   │            ││
│ └───────────────┘            ││
│                              ││
│ Positions (0,1,2) ──────────>││
│                              ││
│ loc: label lookup ──────────>││
│ iloc: direct position access ││
└───────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does df.loc[0] always select the first row? Commit to yes or no.

Common Belief:df.loc[0] selects the first row like iloc[0].

Tap to reveal reality

Quick: Does slicing with loc exclude the last label like normal Python slices? Commit to yes or no.

Common Belief:Slicing with loc excludes the last label, just like normal Python slices.

Tap to reveal reality

Quick: Can iloc accept boolean conditions like loc? Commit to yes or no.

Common Belief:iloc can select rows using boolean conditions just like loc.

Tap to reveal reality

Quick: If a DataFrame has duplicate labels, does loc return only the first matching row? Commit to yes or no.

Common Belief:loc returns only the first row when labels are duplicated.

Tap to reveal reality

Expert Zone

1

When the index is a MultiIndex (multiple levels), loc can select rows using tuples of labels, but iloc still uses simple integer positions.

2

Using loc with slices on unsorted indexes can cause unpredictable results or errors; sorting the index first is recommended.

3

Boolean indexing with loc creates a copy of the data, which can affect memory usage and performance in large datasets.

When NOT to use

Avoid using loc when your index labels are not unique or not meaningful; instead, reset the index or use iloc for position-based selection. For very large datasets where speed is critical, prefer iloc or specialized libraries like Dask.

Production Patterns

In real-world data pipelines, loc is often used for filtering data by meaningful keys like dates or IDs, while iloc is used in loops or algorithms needing fast positional access. Combining loc with boolean conditions is common for data cleaning and feature engineering.

Connections

SQL WHERE clause

loc with boolean conditions works like SQL's WHERE clause to filter rows.

Understanding loc filtering helps grasp how databases select rows, bridging pandas and SQL skills.

Array indexing in NumPy

iloc is similar to NumPy array indexing by position.

Knowing NumPy indexing makes iloc intuitive, as both use zero-based integer positions.

File directory navigation

Selecting rows by label (loc) is like choosing folders by name, while selecting by position (iloc) is like picking folders by their order in a list.

This connection helps understand why both label and position selection are useful in organizing and accessing data.

Common Pitfalls

#1Using loc with integer positions assuming default index.

Wrong approach:df.loc[0:2]

Correct approach:df.iloc[0:2]

Root cause:Confusing loc (label-based) with iloc (position-based) leads to wrong rows being selected.

#2Using iloc with boolean conditions.

Wrong approach:df.iloc[df['Age'] > 30]

Correct approach:df.loc[df['Age'] > 30]

Root cause:iloc does not support boolean indexing; only loc or direct DataFrame filtering does.

#3Assuming loc slice excludes the last label.

Wrong approach:df.loc['A':'C'] expecting rows 'A' and 'B' only

Correct approach:df.loc['A':'C'] includes rows 'A', 'B', and 'C'

Root cause:Not knowing loc slice includes the end label causes off-by-one errors.

Key Takeaways

loc selects rows by their labels, iloc selects rows by their integer positions.

loc includes the last label in slices, while iloc excludes the last position like normal Python slicing.

Boolean conditions work with loc but not with iloc.

Duplicate or missing labels affect loc but not iloc, so choose based on your data's index.

Understanding these differences helps avoid common bugs and write clearer, faster data selection code.