In Pandas, iterating over rows in a DataFrame can be done using several methods, but vectorized operations are generally preferred for performance reasons. However, if you need explicit iteration, here are the most common approaches:
1. Using iterrows()
Iterates over rows as (index, Series)
pairs.
Pros: Easy to use, provides access to the index.
Cons: Slow for large DataFrames (returns a Series
for each row).
python
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": ["x", "y", "z"]})
for index, row in df.iterrows():
print(f"Index: {index}, A: {row['A']}, B: {row['B']}")
2. Using itertuples()
Iterates over rows as namedtuples
.
Pros: Faster than iterrows()
, lightweight.
Cons: Column names with spaces/special characters may cause issues.
python
for row in df.itertuples():
print(f"Index: {row.Index}, A: {row.A}, B: {row.B}")
3. Using apply()
with axis=1
Applies a function row-wise.
Pros: Flexible, can return modified rows.
Cons: Still slower than vectorized operations.
python
def process_row(row):
return row["A"] * 2 # Example: double column A's value
df["A_doubled"] = df.apply(process_row, axis=1)
4. Using a Simple Loop (Not Recommended)
Iterate using iloc
or loc
.
Cons: Very inefficient for large DataFrames.
python
for i in range(len(df)):
print(f"A: {df.iloc[i]['A']}, B: {df.iloc[i]['B']}")
Key Recommendations:
- Avoid Iteration When Possible: Use vectorized operations (e.g.,
df["A"] * 2
instead of row-wise loops). - For Small Data:
itertuples()
is the fastest iteration method. - For Transformations: Use
apply()
withaxis=1
for row-wise logic.
Example: Vectorized vs. Iteration
Vectorized (Preferred):
python
df["A_squared"] = df["A"] ** 2 # Fast and efficient
Iteration (Avoid for Large Data):
python
squared_values = []
for row in df.itertuples():
squared_values.append(row.A ** 2)
df["A_squared"] = squared_values
Performance Comparison:
itertuples()
≈ 5–10x faster thaniterrows()
.- Vectorized operations ≈ 100–1000x faster than row-wise loops.
When to Iterate:
- Row-specific logic that can’t be vectorized (e.g., conditional checks involving multiple columns).
- Prototyping or small datasets.
For large-scale data, consider alternatives like Dask or PySpark if Pandas iteration becomes a bottleneck.