To drop rows in a Pandas DataFrame where a specific column contains NaN
(missing values), you can use the dropna()
method with the subset
parameter. Below are detailed examples and explanations:
1. Basic Example: Drop Rows with NaN
in a Single Column
Sample DataFrame
import pandas as pd
import numpy as np
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
"Age": [25, np.nan, 35, 45, np.nan],
"Salary": [50000, 60000, np.nan, 80000, 90000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 60000.0
2 Charlie 35.0 NaN
3 David 45.0 80000.0
4 Eva NaN 90000.0
Drop Rows Where “Age” is NaN
df_clean = df.dropna(subset=["Age"])
print("\nDataFrame After Dropping NaN in 'Age':")
print(df_clean)
Output:
Name Age Salary
0 Alice 25.0 50000.0
2 Charlie 35.0 NaN
3 David 45.0 80000.0
2. Drop Rows with NaN
in Multiple Columns
Use subset
with a list of columns to drop rows where any of the specified columns have NaN
:
df_clean = df.dropna(subset=["Age", "Salary"])
print(df_clean)
Output:
Name Age Salary
0 Alice 25.0 50000.0
3 David 45.0 80000.0
3. Modify the DataFrame In-Place
Use inplace=True
to modify the original DataFrame instead of creating a new one:
df.dropna(subset=["Salary"], inplace=True)
print(df)
Output:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 60000.0
3 David 45.0 80000.0
4 Eva NaN 90000.0
4. Drop Rows Based on Threshold (thresh
)
Keep rows with at least N
non-NaN
values in the specified subset:
# Keep rows with at least 2 non-NaN values in the subset ["Age", "Salary"]
df_clean = df.dropna(subset=["Age", "Salary"], thresh=2)
print(df_clean)
Output:
Name Age Salary
0 Alice 25.0 50000.0
3 David 45.0 80000.0
5. Alternative: Boolean Indexing
Filter rows using notna()
:
df_clean = df[df["Age"].notna()]
print(df_clean)
Output:
Name Age Salary
0 Alice 25.0 50000.0
2 Charlie 35.0 NaN
3 David 45.0 80000.0
Key Parameters of dropna()
Parameter | Description |
---|---|
subset | Columns to check for NaN (e.g., subset=["Age", "Salary"] ). |
how | – how='any' (default): Drop rows if any subset column has NaN .– how='all' : Drop rows if all subset columns have NaN . |
thresh | Keep rows with at least thresh non-NaN values in the subset. |
inplace | Modify the DataFrame in-place instead of returning a new DataFrame. |
Common Mistakes
- Forgetting
subset
:
# This drops rows with NaN in ANY column (not just "Age"):
df.dropna() # Incorrect if you only want to target "Age"
- Ignoring
inplace
:
# This does NOT modify the original DataFrame:
df.dropna(subset=["Age"])
# Correct approach:
df = df.dropna(subset=["Age"]) # Or use inplace=True
Summary
- Use
df.dropna(subset=["column"])
to drop rows where"column"
hasNaN
. - Combine
subset
withthresh
to enforce a minimum number of valid values. - Prefer
inplace=True
to modify the DataFrame directly.
By mastering these methods, you can efficiently clean your DataFrames!