How do I select rows from a DataFrame based on column values?

In Pandas, you can select rows from a DataFrame based on column values using Boolean indexing, the query() method, or other filtering techniques. Here are the most common and efficient approaches with examples:

1. Boolean Indexing

Use logical conditions to create a Boolean mask and filter rows.
Syntax:

python

df[df['column'] condition]

Examples:

python

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['NY', 'SF', 'LA', 'TX']
}
df = pd.DataFrame(data)

# Select rows where Age > 30
result = df[df['Age'] > 30]
print(result)
# Output:
#      Name  Age City
# 2  Charlie   35   LA
# 3    David   40   TX

# Combine conditions with & (and), | (or), ~ (not)
result = df[(df['Age'] > 25) & (df['City'] == 'SF')]
# Output:
#   Name  Age City
# 1  Bob   30   SF

2. Using loc

Explicitly filter rows (and optionally select columns) using loc:

python

# Select rows where City is 'LA' and show the 'Name' column
result = df.loc[df['City'] == 'LA', 'Name']
# Output:
# 2    Charlie
# Name: Name, dtype: object

# Multiple conditions
result = df.loc[(df['Age'] < 40) & (df['Name'].str.startswith('C')]
# Output:
#      Name  Age City
# 2  Charlie   35   LA

3. query() Method

Write SQL-like syntax for readability (especially for complex conditions):

python

result = df.query("Age > 30 and City in ['LA', 'TX']")
# Output:
#      Name  Age City
# 2  Charlie   35   LA
# 3    David   40   TX

4. Filter with isin()

Select rows where a column value is in a list:

python

cities = ['NY', 'LA']
result = df[df['City'].isin(cities)]
# Output:
#      Name  Age City
# 0   Alice   25   NY
# 2  Charlie   35   LA

5. String Operations

Use string methods for text filtering:

python

# Select rows where Name contains 'a' (case-sensitive)
result = df[df['Name'].str.contains('a')]
# Output:
#      Name  Age City
# 0   Alice   25   NY
# 2  Charlie   35   LA

6. between() for Ranges

Filter rows where a value falls within a range:

python

result = df[df['Age'].between(30, 35, inclusive='both')]
# Output:
#      Name  Age City
# 1     Bob   30   SF
# 2  Charlie   35   LA

Key Tips:

  • Avoid Chained Indexing: Use df.loc[] instead of df[df.column > x]['other_column'] to prevent SettingWithCopyWarning.
  • Performance: Boolean indexing and query() are optimized for speed.
  • Null Values: Use df[df['column'].notna()] or df[df['column'].isna()] for handling NaN.

Summary

MethodUse Case
Boolean IndexingSimple to moderate conditions.
loc/ilocExplicit row/column selection.
query()Readable syntax for complex conditions.
isin()/between()Filtering against lists or ranges.

Example for Multiple Conditions:

python

result = df.loc[
    (df['Age'] > 25) & 
    ~df['City'].isin(['TX']) & 
    df['Name'].str.contains('e')
]
# Output:
#      Name  Age City
# 2  Charlie   35   LA

Choose the method that best fits your code’s readability and performance needs!

Leave a Reply

Your email address will not be published. Required fields are marked *