Data Science in Drilling - Episode 7

Zeyu Yan
May 31, 2022
5 min read

Map, Apply, Agg and Transform Methods in Pandas

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

This is another Pandas episode. Enjoy!

Enjoying great knowledge is just like enjoying delicious seafood tower.

Introduction

When it comes to Pandas' built-in methods including map, apply, agg and transform, there are usually lots of confusions. What's the difference between these methods? Which method should I use for a specific scenario? After reading this article, one will get clear answers for the aforementioned questions.

What We'll Cover Today

Which methods can be used on Pandas DataFrames.
Which methods can be used on Pandas Series.
Which methods can be used together with Pandas' groupby method.

Methods Used on Pandas Series

The two methods which are usually used on Pandas Series are map and apply. To demonstrate how to use these two methods on a Pandas Series, let's first define a test Series:

import pandas as pd

x = pd.Series([1, 2, 3], index=['one', 'two', 'three'])
print(x)

The test Series looks like this:

one      1
two      2
three    3
dtype: int64

The map method can be conveniently used with a lambda function:

x.map(lambda x: 2 * x + 1)

The resulted Series is:

one      3
two      5
three    7
dtype: int64

Let's define another Series:

y = pd.Series([1.2345, 2.67234, 5.21889])
print(y)

Which looks as follows:

0    1.23450
1    2.67234
2    5.21889
dtype: float64

Then round each element in the Series to two decimals using map:

y.map(lambda x: round(x, 2))

The resulted Series is:

0    1.23
1    2.67
2    5.22
dtype: float64

Besides lambda function, customized functions can also be used with the map method. Define the following customized function:

def square(x):
    return x ** 2

Then map it to each element in the Series:

x.map(square)

The resulted Series is:

one      1
two      4
three    9
dtype: int64

The apply method can also be used on Pandas Series. For example:

x.apply(square)

The same results can be obtained:

one      1
two      4
three    9
dtype: int64

The apply method also works with lambda functions:

x.apply(lambda x: x ** 2)

The resulted Series is:

one      1
two      4
three    9
dtype: int64

One benefit from using the apply method is that input arguments can be passed into the customized functions. Say I have the following customized function:

def subtract(x, value):
    return x - value

This is how one can pass 1 as value to the subtract function:

x.apply(subtract, args=(1, ))

The resulted Series is:

one      0
two      1
three    2
dtype: int64

Let's define another customized function which can accept multiple input arguments:

def add(x, *args):
    for value in args:
        x += value
    return x

This is how one can pass multiple input arguments through apply:

x.apply(add, args=(1, 2, 3))

The resulted Series is:

one      7
two      8
three    9
dtype: int64

Methods Used on Pandas DataFrames

The most common method which can be used on Pandas DataFrames is the apply method. When using on a DataFrame, the apply method implicitly passes all the columns of the DataFrame to the customized/lambda function. First let's define a dummy DataFrame for testing:

df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
print(df)

The dummy DataFrame is displayed as follows:

Let's try to get the square root of each element in the DataFrame:

import numpy as np

df.apply(np.sqrt)

The resulted DataFrame is:

Then let's calculate the mean value for each column of the DataFrame:

df.apply(np.mean, axis=0)

Here axis=0 means the calculations will be based on the columns of the DataFrame. The resulted Series is:

A    4.0
B    9.0
dtype: float64

If the mean value of each row needs to be calculated instead, axis=1 can be used:

df.apply(np.mean, axis=1)

The resulted Series is:

0    6.5
1    6.5
2    6.5
dtype: float64

Now let's say that I want to add a new column "C" to the DataFrame, whose value of a specific row is the sum of the value of column "A" and column "B" from the same row. This can be implemented combining the apply method and a lambda function:

df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
print(df)

The resulted DataFrame is:

This can also be realized through combing the apply method and a customized function:

def custom_sum(row):
    return row['A'] + row['B']

df['D'] = df.apply(custom_sum, axis=1)
print(df)

The resulted DataFrame is:

Methods Used with Groupby

The apply, agg and transform methods can all be used with Pandas' groupby method. Let's first see how the apply method works with the groupby method. When combining with the

groupby method, all the columns of each group are passed to the customized/lambda function. Define the dummy DataFrame as follows:

df = pd.DataFrame({
    'A': ['a', 'b', 'c', 'a', 'c'],
    'B': np.random.randint(10, size=5),
    'C': np.random.randint(5, size=5)
})
print(df)

The dummy DataFrame is:

Let's try to group the DataFrame by column "A" and divide each element in the group by the sum of all the elements in the group:

df.groupby('A').apply(lambda x: x / x.sum())

The resulted DataFrame is:

Then let's try to find the difference between the maximum and minimum values for each group:

df.groupby('A').apply(lambda x: x.max() - x.min())

The resulted DataFrame is:

Lastly, let's try to find the difference between the maximum value of column "C" and the minimum value of column "B" for each group:

df.groupby('A').apply(lambda x: x['C'].max() - x['B'].min())

The resulted Series is:

A
a   -1
b    0
c    3
dtype: int64

The agg method is usually used to find the aggregations for groups of the DataFrame. For example, group the DataFrame by column "A" and find the minimum values of the other columns for each group:

df.groupby('A').agg('min')

The resulted DataFrame is:

Let's find both the minimum and the maximum values of the other columns for each group:

df.groupby('A').agg(['min', 'max'])

The resulted DataFrame is:

If only the minimum and the maximum values of column "B" for each group are wanted:

df.groupby('A')['B'].agg(['min', 'max'])

The resulted DataFrame is:

Lastly, different aggregations can be applied to different columns of the DataFrame through passing a Python Dictionary to the agg function. Let's try to find both the minimum and the maximum values of column "B", and the sum of column "C", for each group:

df.groupby('A').agg({
    'B': ['min', 'max'],
    'C': 'sum'
})

The resulted DataFrame is:

The transform method is similar to the apply method, but the difference is that the transform method accepts one column of data at one time for processing. On the other hand, the output format of the transform method is also different from that of the apply method. Let's run one of the previous example again using the transform method instead of the apply method:

df.groupby('A').transform(lambda x: x.max() - x.min())

The resulted DataFrame is:

The main difference here is that the resulted DataFrame using the apply method had 3 rows, which equaled to the number of the groups, while the resulted DataFrame using the transform method had the same number of rows as the original DataFrame. Here is another example using the transform method:

df.groupby('A').transform(lambda x: (x - x.mean()) / x.std())

The resulted DataFrame is:

The NaN value was caused by a zero standard deviation since there was only one element in the group.

Conclusions

In this article, we mainly went through how to use Pandas' built-in methods including map, apply, agg and transform in different scenarios. More skills about Pandas will be covered in the future episodes as well. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 7

Map, Apply, Agg and Transform Methods in Pandas

Introduction

What We'll Cover Today

Methods Used on Pandas Series

Methods Used on Pandas DataFrames

Methods Used with Groupby

Conclusions

Get in Touch

Recent Posts

Comments