How to filter Pandas dataframe using ‘in’ and ‘not in’ like in SQL?

To filter a pandas DataFrame using “IN” and “NOT IN” logic (similar to SQL), you can use the isin() method combined with boolean indexing. Here’s how:

1. Filter Using IN (Select Rows Where Column Value is in a List)

Use df[df['column'].isin(list_of_values)]:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 45]
})

# Values to filter
selected_names = ['Alice', 'Charlie', 'Eve']

# Filter rows where 'name' is in the list
filtered_df = df[df['name'].isin(selected_names)]

Output:

      name  age
0    Alice   25
2  Charlie   35
4      Eve   45

2. Filter Using NOT IN (Exclude Rows Where Column Value is in a List)

Use the ~ operator to negate isin():

# Exclude rows where 'name' is in the list
excluded_df = df[~df['name'].isin(selected_names)]

Output:

    name  age
1    Bob   30
3  David   40

3. Filter with Multiple Columns

If you need to check values across multiple columns, combine conditions:

# Values to filter
ages_to_exclude = [25, 40]

# Filter where 'name' is in selected_names AND 'age' not in ages_to_exclude
filtered_df = df[
    df['name'].isin(selected_names) & 
    ~df['age'].isin(ages_to_exclude)
]

Output:

      name  age
2  Charlie   35
4      Eve   45

4. Using query() for SQL-Like Syntax

For a more SQL-like approach, use df.query():

# Filter using "IN"
filtered_df = df.query("name in @selected_names")

# Filter using "NOT IN"
excluded_df = df.query("name not in @selected_names")

Key Notes:

  • isin() works with lists, NumPy arrays, or pandas Series.
  • Use ~ to invert the condition (equivalent to NOT IN).
  • For large DataFrames, isin() is faster than using apply or loops.
  • Combine multiple conditions with & (AND) or | (OR) and wrap each condition in parentheses.

Example with Multiple Conditions:

# Filter rows where 'name' is in selected_names OR 'age' > 35
filtered_df = df[
    df['name'].isin(selected_names) | 
    (df['age'] > 35)
]

Output:

      name  age
0    Alice   25
2  Charlie   35
3    David   40
4      Eve   45

This approach mirrors SQL’s IN and NOT IN functionality while leveraging pandas’ vectorized operations for efficiency.

Leave a Reply

Your email address will not be published. Required fields are marked *