How can I iterate over rows in a Pandas DataFrame?

In Pandas, iterating over rows in a DataFrame can be done using several methods, but vectorized operations are generally preferred for performance reasons. However, if you need explicit iteration, here are the most common approaches:

1. Using iterrows()

Iterates over rows as (index, Series) pairs.
Pros: Easy to use, provides access to the index.
Cons: Slow for large DataFrames (returns a Series for each row).

python

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3], "B": ["x", "y", "z"]})

for index, row in df.iterrows():
    print(f"Index: {index}, A: {row['A']}, B: {row['B']}")

2. Using itertuples()

Iterates over rows as namedtuples.
Pros: Faster than iterrows(), lightweight.
Cons: Column names with spaces/special characters may cause issues.

python

for row in df.itertuples():
    print(f"Index: {row.Index}, A: {row.A}, B: {row.B}")

3. Using apply() with axis=1

Applies a function row-wise.
Pros: Flexible, can return modified rows.
Cons: Still slower than vectorized operations.

python

def process_row(row):
    return row["A"] * 2  # Example: double column A's value

df["A_doubled"] = df.apply(process_row, axis=1)

4. Using a Simple Loop (Not Recommended)

Iterate using iloc or loc.
Cons: Very inefficient for large DataFrames.

python

for i in range(len(df)):
    print(f"A: {df.iloc[i]['A']}, B: {df.iloc[i]['B']}")

Key Recommendations:

  • Avoid Iteration When Possible: Use vectorized operations (e.g., df["A"] * 2 instead of row-wise loops).
  • For Small Dataitertuples() is the fastest iteration method.
  • For Transformations: Use apply() with axis=1 for row-wise logic.

Example: Vectorized vs. Iteration

Vectorized (Preferred):

python

df["A_squared"] = df["A"] ** 2  # Fast and efficient

Iteration (Avoid for Large Data):

python

squared_values = []
for row in df.itertuples():
    squared_values.append(row.A ** 2)
df["A_squared"] = squared_values

Performance Comparison:

  • itertuples() ≈ 5–10x faster than iterrows().
  • Vectorized operations ≈ 100–1000x faster than row-wise loops.

When to Iterate:

  • Row-specific logic that can’t be vectorized (e.g., conditional checks involving multiple columns).
  • Prototyping or small datasets.

For large-scale data, consider alternatives like Dask or PySpark if Pandas iteration becomes a bottleneck.

Leave a Reply

Your email address will not be published. Required fields are marked *