20  Pandas中其他分组操作

20.1 介绍

在我们之前的课程中,您已经学习了如何使用 groupby()agg() 从分组中提取汇总统计。现在,我们将更进一步,探索一些额外有用的分组数据转换。让我们开始吧。

20.2 学习目标

在本课程结束时,您将能够:

  • 使用 transform() 将分组级别的汇总统计添加为新列。
  • 使用 value_counts() 统计组内的值。
  • 计算组内的累积和。

20.3 导入

运行以下单元以导入必要的库:

import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar

20.4 数据

我们将使用 weather 数据集作为例子。

weather_raw = vd.data.seattle_weather()

# 使用 query 选择2012年的数据,并添加一个月份列
weather = weather_raw.query("date.dt.year == 2012")
weather["month"] = pd.Categorical(
    weather["date"].dt.strftime("%B"),
    categories=list(calendar.month_name[1:]),
    ordered=True,
)
weather
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_11107/375086048.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
date precipitation temp_max temp_min wind weather month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January
1 2012-01-02 10.9 10.6 2.8 4.5 rain January
2 2012-01-03 0.8 11.7 7.2 2.3 rain January
3 2012-01-04 20.3 12.2 5.6 4.7 rain January
4 2012-01-05 1.3 8.9 2.8 6.1 rain January
... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December
362 2012-12-28 0.0 8.3 3.9 1.7 rain December
363 2012-12-29 1.5 5.0 3.3 1.7 rain December
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December

366 rows × 7 columns

现在让我们设置本课程其余部分的显示选项:

pd.options.display.max_rows = 20

并且让我们忽略在使用当前版本的 pandas 处理分类数据时出现的警告:

warnings.filterwarnings(
    "ignore"
)  ## 有一类关于使用当前版本 pandas 处理分类数据时出现的警告,我们可以忽略

20.5 使用 transform() 添加汇总统计

在上一课中,您学习了如何使用 agg() 计算平均值、中位数或标准差等汇总统计。

例如,要计算每个月的平均降水量(雨+雪),您可以使用:

weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
mean_precip
month
January 5.590323
February 3.182759
March 5.903226
April 2.270000
May 1.683871
... ...
August 0.000000
September 0.030000
October 5.493548
November 7.016667
December 5.612903

12 rows × 1 columns

有时,我们希望将这些组级别的统计添加为原始 DataFrame 的新列。我们无法直接使用 agg() 的输出来实现:

# 无法工作
weather['mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather
date precipitation temp_max temp_min wind weather month mean_precip
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN
... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN

366 rows × 8 columns

但我们可以使用 transform() 来实现。transform() 将输出重新整形以匹配原始 DataFrame 的形状,允许我们将组级别的统计添加为新列。

weather['mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323
... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903

366 rows × 9 columns

您可以以类似方式计算其他统计。例如,要计算每个月的降水中位数,您可以使用:

weather['prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')    
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0
... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3

366 rows × 10 columns

或者计算每个月的降水总和:

weather['precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month precip_sum_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0 173.3
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0 173.3
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0 173.3
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0 173.3
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0 173.3
... ... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3 174.0
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3 174.0
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3 174.0
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3 174.0
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3 174.0

366 rows × 11 columns

有了总和,我们可以轻松计算该月降水量在每天中所占的比例:

weather["precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month precip_sum_month precip_month_prop
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0 173.3 0.000000
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0 173.3 0.062897
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0 173.3 0.004616
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0 173.3 0.117138
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0 173.3 0.007501
... ... ... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3 174.0 0.023563
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3 174.0 0.000000
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3 174.0 0.008621
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3 174.0 0.000000
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3 174.0 0.000000

366 rows × 12 columns

练习

20.6 练习题:每日小费比例

使用 tips 数据集,计算: 1. 一个包含每天总小费的新列 daily_total_tips 2. 一个显示每位顾客该日总小费的比例的新列 tip_proportion

# Your code here:
tips = px.data.tips()
tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

您的输出数据的前几行应类似于:

total_bill    tip     sex      smoker    day    time      size    daily_total_tips    tip_proportion
16.99         1.01    Female   No        Sun    Dinner    2       247.39              0.004083
10.34         1.66    Male     No        Sun    Dinner    3       247.39              0.006710
21.01         3.50    Male     No        Sun    Dinner    3       247.39              0.014148
23.68         3.31    Male     No        Sun    Dinner    2       247.39              0.013380

让我们重新初始化 weather DataFrame 为一个较小的列集合,用于本课程的其余部分:

weather = weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather
date month precipitation wind weather
0 2012-01-01 January 0.0 4.7 drizzle
1 2012-01-02 January 10.9 4.5 rain
2 2012-01-03 January 0.8 2.3 rain
3 2012-01-04 January 20.3 4.7 rain
4 2012-01-05 January 1.3 6.1 rain
... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain
362 2012-12-28 December 0.0 1.7 rain
363 2012-12-29 December 1.5 1.7 rain
364 2012-12-30 December 0.0 1.8 drizzle
365 2012-12-31 December 0.0 2.0 drizzle

366 rows × 5 columns

20.7 使用 value_counts() 统计组内的值

统计分类变量在组内的出现次数可以揭示有趣的模式,通常需要在使用 groupby() 后进行。

首先,让我们回顾一下 value_counts() 在整个 DataFrame 上的工作方式。

# 天气类型的计数
weather["weather"].value_counts()
weather
rain       191
sun        118
drizzle     31
snow        21
fog          5
Name: count, dtype: int64

我们可以添加 normalize=True 来获得比例:

weather['weather'].value_counts(normalize=True)
weather
rain       0.521858
sun        0.322404
drizzle    0.084699
snow       0.057377
fog        0.013661
Name: proportion, dtype: float64

现在,要统计每个月的天气类型,我们首先按 month 分组,然后选择 weather 列并应用 value_counts()

# 每个月天气类型的计数
weather.groupby('month')['weather'].value_counts()
month     weather
January   rain       18
          snow        7
          sun         4
          drizzle     2
          fog         0
                     ..
December  rain       23
          snow        5
          drizzle     2
          sun         1
          fog         0
Name: count, Length: 60, dtype: int64

这将返回一个具有多重索引的 Series,可以使用 reset_index() 转换为常规的 DataFrame:

weather.groupby('month')['weather'].value_counts().reset_index()
month weather count
0 January rain 18
1 January snow 7
2 January sun 4
3 January drizzle 2
4 January fog 0
... ... ... ...
55 December rain 23
56 December snow 5
57 December drizzle 2
58 December sun 1
59 December fog 0

60 rows × 3 columns

练习

20.8 练习题:按日统计吸烟者和非吸烟者人数

使用 tips 数据集,统计每一天吸烟者和非吸烟者的数量。

tips = px.data.tips()
tips

# Your code here:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

您的结果的前几行应类似于:

day smoker  count
Fri Yes 15
Fri No  4
Sat No  45

20.9 在组内计算累积和

累积和有助于跟踪组内的运行总计。这是一种常用的操作。让我们看看如何对分组数据执行此操作。

回顾一下,以下是如何计算整个 DataFrame 的降水累积和:

# 降水的累积和
weather["precip_cumul"] = weather["precipitation"].cumsum()
weather
date month precipitation wind weather precip_cumul
0 2012-01-01 January 0.0 4.7 drizzle 0.0
1 2012-01-02 January 10.9 4.5 rain 10.9
2 2012-01-03 January 0.8 2.3 rain 11.7
3 2012-01-04 January 20.3 4.7 rain 32.0
4 2012-01-05 January 1.3 6.1 rain 33.3
... ... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain 1224.5
362 2012-12-28 December 0.0 1.7 rain 1224.5
363 2012-12-29 December 1.5 1.7 rain 1226.0
364 2012-12-30 December 0.0 1.8 drizzle 1226.0
365 2012-12-31 December 0.0 2.0 drizzle 1226.0

366 rows × 6 columns

要计算每个月的降水累积和,我们可以使用 groupby()cumsum():

# 每个月的降水累积和
weather["precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather
date month precipitation wind weather precip_cumul
0 2012-01-01 January 0.0 4.7 drizzle 0.0
1 2012-01-02 January 10.9 4.5 rain 10.9
2 2012-01-03 January 0.8 2.3 rain 11.7
3 2012-01-04 January 20.3 4.7 rain 32.0
4 2012-01-05 January 1.3 6.1 rain 33.3
... ... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain 172.5
362 2012-12-28 December 0.0 1.7 rain 172.5
363 2012-12-29 December 1.5 1.7 rain 174.0
364 2012-12-30 December 0.0 1.8 drizzle 174.0
365 2012-12-31 December 0.0 2.0 drizzle 174.0

366 rows × 6 columns

20.10 练习题:按日累积小费金额

使用 tips 数据集,计算每个 daytotal_bill 累积和,添加一个新列 cumul_total_bill_day。然后添加另一列 cumul_tip_day,包含每个 daytip 累积和。

tips = px.data.tips()
tips = tips.sort_values('day')
tips
# Your code here:
total_bill tip sex smoker day time size
96 27.28 4.00 Male Yes Fri Dinner 2
101 15.38 3.00 Female Yes Fri Dinner 2
98 21.01 3.00 Male Yes Fri Dinner 2
97 12.03 1.50 Male Yes Fri Dinner 2
95 40.17 4.73 Male Yes Fri Dinner 4
... ... ... ... ... ... ... ...
132 11.17 1.50 Female No Thur Lunch 2
131 20.27 2.83 Female No Thur Lunch 2
130 19.08 1.50 Male No Thur Lunch 2
128 11.38 2.00 Female No Thur Lunch 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

20.11 总结

在本课程中,您学习了 pandas 中几个强大的 分组级数据转换:

  • 添加汇总统计:使用 transform() 将分组级别的计算添加为新列
  • 组内计数:使用 value_counts() 统计组内的出现次数
  • 计算累积和:跟踪组内的运行总计

这些技术使您能够分析数据中特定子集中的模式和统计信息。继续练习不同的数据集,以提升您的数据操作技能!

下次见!