import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar
20 Pandas中其他分组操作
20.1 介绍
在我们之前的课程中,您已经学习了如何使用 groupby()
和 agg()
从分组中提取汇总统计。现在,我们将更进一步,探索一些额外有用的分组数据转换。让我们开始吧。
20.2 学习目标
在本课程结束时,您将能够:
- 使用
transform()
将分组级别的汇总统计添加为新列。 - 使用
value_counts()
统计组内的值。 - 计算组内的累积和。
20.3 导入
运行以下单元以导入必要的库:
20.4 数据
我们将使用 weather
数据集作为例子。
= vd.data.seattle_weather()
weather_raw
# 使用 query 选择2012年的数据,并添加一个月份列
= weather_raw.query("date.dt.year == 2012")
weather "month"] = pd.Categorical(
weather["date"].dt.strftime("%B"),
weather[=list(calendar.month_name[1:]),
categories=True,
ordered
) weather
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_11107/375086048.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
date | precipitation | temp_max | temp_min | wind | weather | month | |
---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January |
... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December |
366 rows × 7 columns
现在让我们设置本课程其余部分的显示选项:
= 20 pd.options.display.max_rows
并且让我们忽略在使用当前版本的 pandas 处理分类数据时出现的警告:
warnings.filterwarnings("ignore"
## 有一类关于使用当前版本 pandas 处理分类数据时出现的警告,我们可以忽略 )
20.5 使用 transform()
添加汇总统计
在上一课中,您学习了如何使用 agg()
计算平均值、中位数或标准差等汇总统计。
例如,要计算每个月的平均降水量(雨+雪),您可以使用:
'month').agg(mean_precip = ('precipitation', 'mean')) weather.groupby(
mean_precip | |
---|---|
month | |
January | 5.590323 |
February | 3.182759 |
March | 5.903226 |
April | 2.270000 |
May | 1.683871 |
... | ... |
August | 0.000000 |
September | 0.030000 |
October | 5.493548 |
November | 7.016667 |
December | 5.612903 |
12 rows × 1 columns
有时,我们希望将这些组级别的统计添加为原始 DataFrame 的新列。我们无法直接使用 agg()
的输出来实现:
# 无法工作
'mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | |
---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN |
366 rows × 8 columns
但我们可以使用 transform()
来实现。transform()
将输出重新整形以匹配原始 DataFrame 的形状,允许我们将组级别的统计添加为新列。
'mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | |
---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 |
366 rows × 9 columns
您可以以类似方式计算其他统计。例如,要计算每个月的降水中位数,您可以使用:
'prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 |
366 rows × 10 columns
或者计算每个月的降水总和:
'precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 |
366 rows × 11 columns
有了总和,我们可以轻松计算该月降水量在每天中所占的比例:
"precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | precip_month_prop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 | 0.000000 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.062897 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.004616 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.117138 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.007501 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.023563 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.008621 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
366 rows × 12 columns
20.6 练习题:每日小费比例
使用 tips
数据集,计算: 1. 一个包含每天总小费的新列 daily_total_tips
2. 一个显示每位顾客该日总小费的比例的新列 tip_proportion
# Your code here:
= px.data.tips()
tips tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
您的输出数据的前几行应类似于:
total_bill tip sex smoker day time size daily_total_tips tip_proportion
16.99 1.01 Female No Sun Dinner 2 247.39 0.004083
10.34 1.66 Male No Sun Dinner 3 247.39 0.006710
21.01 3.50 Male No Sun Dinner 3 247.39 0.014148
23.68 3.31 Male No Sun Dinner 2 247.39 0.013380
让我们重新初始化 weather
DataFrame 为一个较小的列集合,用于本课程的其余部分:
= weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather weather
date | month | precipitation | wind | weather | |
---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain |
... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle |
366 rows × 5 columns
20.7 使用 value_counts()
统计组内的值
统计分类变量在组内的出现次数可以揭示有趣的模式,通常需要在使用 groupby()
后进行。
首先,让我们回顾一下 value_counts()
在整个 DataFrame 上的工作方式。
# 天气类型的计数
"weather"].value_counts() weather[
weather
rain 191
sun 118
drizzle 31
snow 21
fog 5
Name: count, dtype: int64
我们可以添加 normalize=True
来获得比例:
'weather'].value_counts(normalize=True) weather[
weather
rain 0.521858
sun 0.322404
drizzle 0.084699
snow 0.057377
fog 0.013661
Name: proportion, dtype: float64
现在,要统计每个月的天气类型,我们首先按 month
分组,然后选择 weather
列并应用 value_counts()
。
# 每个月天气类型的计数
'month')['weather'].value_counts() weather.groupby(
month weather
January rain 18
snow 7
sun 4
drizzle 2
fog 0
..
December rain 23
snow 5
drizzle 2
sun 1
fog 0
Name: count, Length: 60, dtype: int64
这将返回一个具有多重索引的 Series,可以使用 reset_index()
转换为常规的 DataFrame:
'month')['weather'].value_counts().reset_index() weather.groupby(
month | weather | count | |
---|---|---|---|
0 | January | rain | 18 |
1 | January | snow | 7 |
2 | January | sun | 4 |
3 | January | drizzle | 2 |
4 | January | fog | 0 |
... | ... | ... | ... |
55 | December | rain | 23 |
56 | December | snow | 5 |
57 | December | drizzle | 2 |
58 | December | sun | 1 |
59 | December | fog | 0 |
60 rows × 3 columns
20.8 练习题:按日统计吸烟者和非吸烟者人数
使用 tips
数据集,统计每一天吸烟者和非吸烟者的数量。
= px.data.tips()
tips
tips
# Your code here:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
您的结果的前几行应类似于:
day smoker count
Fri Yes 15
Fri No 4
Sat No 45
20.9 在组内计算累积和
累积和有助于跟踪组内的运行总计。这是一种常用的操作。让我们看看如何对分组数据执行此操作。
回顾一下,以下是如何计算整个 DataFrame 的降水累积和:
# 降水的累积和
"precip_cumul"] = weather["precipitation"].cumsum()
weather[ weather
date | month | precipitation | wind | weather | precip_cumul | |
---|---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 |
... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 1224.5 |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 1224.5 |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 1226.0 |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 1226.0 |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 1226.0 |
366 rows × 6 columns
要计算每个月的降水累积和,我们可以使用 groupby()
和 cumsum()
:
# 每个月的降水累积和
"precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather[ weather
date | month | precipitation | wind | weather | precip_cumul | |
---|---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 |
... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 172.5 |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 172.5 |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 174.0 |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 174.0 |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 174.0 |
366 rows × 6 columns
20.10 练习题:按日累积小费金额
使用 tips
数据集,计算每个 day
的 total_bill
累积和,添加一个新列 cumul_total_bill_day
。然后添加另一列 cumul_tip_day
,包含每个 day
的 tip
累积和。
= px.data.tips()
tips = tips.sort_values('day')
tips
tips# Your code here:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
96 | 27.28 | 4.00 | Male | Yes | Fri | Dinner | 2 |
101 | 15.38 | 3.00 | Female | Yes | Fri | Dinner | 2 |
98 | 21.01 | 3.00 | Male | Yes | Fri | Dinner | 2 |
97 | 12.03 | 1.50 | Male | Yes | Fri | Dinner | 2 |
95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
132 | 11.17 | 1.50 | Female | No | Thur | Lunch | 2 |
131 | 20.27 | 2.83 | Female | No | Thur | Lunch | 2 |
130 | 19.08 | 1.50 | Male | No | Thur | Lunch | 2 |
128 | 11.38 | 2.00 | Female | No | Thur | Lunch | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
20.11 总结
在本课程中,您学习了 pandas 中几个强大的 分组级数据转换:
- 添加汇总统计:使用
transform()
将分组级别的计算添加为新列 - 组内计数:使用
value_counts()
统计组内的出现次数 - 计算累积和:跟踪组内的运行总计
这些技术使您能够分析数据中特定子集中的模式和统计信息。继续练习不同的数据集,以提升您的数据操作技能!
下次见!