How to create a new column where the values are selected based on an existing column?

To create a new column in a pandas DataFrame where values are conditionally derived from an existing column, you can use methods like np.where(), pandas.DataFrame.apply(), or np.select(), depending on the complexity of your conditions. Here are common approaches:

1. Simple Binary Conditions: np.where()

Use numpy.where(condition, value_if_true, value_if_false) for straightforward if-else logic.

import pandas as pd
import numpy as np

# Sample DataFrame
df = pd.DataFrame({'score': [85, 45, 72, 90, 60]})

# Create a new column: 'pass' if score >= 60, else 'fail'
df['result'] = np.where(df['score'] >= 60, 'pass', 'fail')

Output:

   score result
0     85   pass
1     45   fail
2     72   pass
3     90   pass
4     60   pass

2. Complex Logic: apply() with a Custom Function

Use apply() for multi-condition or complex operations.

# Categorize scores into grades
def assign_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    else:
        return 'F'

df['grade'] = df['score'].apply(assign_grade)

Output:

   score result grade
0     85   pass     B
1     45   fail     F
2     72   pass     C
3     90   pass     A
4     60   pass     F

3. Multiple Conditions: np.select()

Use numpy.select() for multiple conditions and corresponding outputs.

# Define conditions and choices
conditions = [
    df['score'] >= 90,
    df['score'] >= 80,
    df['score'] >= 60,
    df['score'] < 60
]

choices = ['Excellent', 'Good', 'Pass', 'Fail']

df['category'] = np.select(conditions, choices, default='Unknown')

Output:

   score result grade   category
0     85   pass     B       Good
1     45   fail     F       Fail
2     72   pass     C       Pass
3     90   pass     A  Excellent
4     60   pass     F       Pass

4. Boolean Indexing with .loc

Directly assign values using boolean masks.

# Initialize a new column
df['status'] = 'Neutral'

# Update values conditionally
df.loc[df['score'] >= 80, 'status'] = 'High'
df.loc[df['score'] < 60, 'status'] = 'Low'

Output:

   score result grade   category  status
0     85   pass     B       Good    High
1     45   fail     F       Fail     Low
2     72   pass     C       Pass Neutral
3     90   pass     A  Excellent    High
4     60   pass     F       Pass Neutral

5. Mapping Values: map() with a Dictionary

Use a dictionary to map existing values to new ones.

# Map grades to remarks
grade_to_remark = {
    'A': 'Outstanding',
    'B': 'Very Good',
    'C': 'Average',
    'F': 'Needs Improvement'
}

df['remark'] = df['grade'].map(grade_to_remark)

Output:

   score result grade   category  status            remark
0     85   pass     B       Good    High        Very Good
1     45   fail     F       Fail     Low  Needs Improvement
2     72   pass     C       Pass Neutral          Average
3     90   pass     A  Excellent    High      Outstanding
4     60   pass     F       Pass Neutral  Needs Improvement

Key Notes:

  • np.where(): Best for simple binary conditions.
  • apply(): Flexible for complex logic but slower for large datasets.
  • np.select(): Efficient for multiple conditions.
  • .loc: Useful for direct assignment to subsets of the DataFrame.
  • map(): Ideal for direct value replacement using a dictionary.

For large datasets, prioritize vectorized operations (np.where, np.select) over apply() for better performance.

Leave a Reply

Your email address will not be published. Required fields are marked *