20 Pandas中其他分组操作 – 用 Python 介绍数据科学

20.1 介绍

在我们之前的课程中,您已经学习了如何使用 groupby() 和 agg() 从分组中提取汇总统计。现在,我们将更进一步,探索一些额外有用的分组数据转换。让我们开始吧。

20.2 学习目标

在本课程结束时,您将能够:

使用 transform() 将分组级别的汇总统计添加为新列。
使用 value_counts() 统计组内的值。
计算组内的累积和。

20.3 导入

运行以下单元以导入必要的库:

import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar

20.4 数据

我们将使用 weather 数据集作为例子。

weather_raw = vd.data.seattle_weather()

# 使用 query 选择2012年的数据,并添加一个月份列
weather = weather_raw.query("date.dt.year == 2012")
weather["month"] = pd.Categorical(
    weather["date"].dt.strftime("%B"),
    categories=list(calendar.month_name[1:]),
    ordered=True,
)
weather

/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_11107/375086048.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	date	precipitation	temp_max	temp_min	wind	weather	month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January
...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December

366 rows × 7 columns

现在让我们设置本课程其余部分的显示选项:

pd.options.display.max_rows = 20

并且让我们忽略在使用当前版本的 pandas 处理分类数据时出现的警告:

warnings.filterwarnings(
    "ignore"
)  ## 有一类关于使用当前版本 pandas 处理分类数据时出现的警告,我们可以忽略

20.5 使用 `transform()` 添加汇总统计

在上一课中,您学习了如何使用 agg() 计算平均值、中位数或标准差等汇总统计。

例如,要计算每个月的平均降水量(雨+雪),您可以使用:

weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))

	mean_precip
month
January	5.590323
February	3.182759
March	5.903226
April	2.270000
May	1.683871
...	...
August	0.000000
September	0.030000
October	5.493548
November	7.016667
December	5.612903

12 rows × 1 columns

有时,我们希望将这些组级别的统计添加为原始 DataFrame 的新列。我们无法直接使用 agg() 的输出来实现:

# 无法工作
weather['mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN
...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN

366 rows × 8 columns

但我们可以使用 transform() 来实现。transform() 将输出重新整形以匹配原始 DataFrame 的形状,允许我们将组级别的统计添加为新列。

weather['mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323
...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903

366 rows × 9 columns

您可以以类似方式计算其他统计。例如,要计算每个月的降水中位数,您可以使用:

weather['prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')    
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0
...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3

366 rows × 10 columns

或者计算每个月的降水总和:

weather['precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month	precip_sum_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0	173.3
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0	173.3
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0	173.3
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0	173.3
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0	173.3
...	...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3	174.0
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3	174.0
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3	174.0
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3	174.0
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3	174.0

366 rows × 11 columns

有了总和,我们可以轻松计算该月降水量在每天中所占的比例:

weather["precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month	precip_sum_month	precip_month_prop
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0	173.3	0.000000
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0	173.3	0.062897
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0	173.3	0.004616
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0	173.3	0.117138
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0	173.3	0.007501
...	...	...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3	174.0	0.023563
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3	174.0	0.000000
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3	174.0	0.008621
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3	174.0	0.000000
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3	174.0	0.000000

366 rows × 12 columns

练习

20.6 练习题:每日小费比例

使用 tips 数据集,计算: 1. 一个包含每天总小费的新列 daily_total_tips 2. 一个显示每位顾客该日总小费的比例的新列 tip_proportion

# Your code here:
tips = px.data.tips()
tips

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

您的输出数据的前几行应类似于:

total_bill    tip     sex      smoker    day    time      size    daily_total_tips    tip_proportion
16.99         1.01    Female   No        Sun    Dinner    2       247.39              0.004083
10.34         1.66    Male     No        Sun    Dinner    3       247.39              0.006710
21.01         3.50    Male     No        Sun    Dinner    3       247.39              0.014148
23.68         3.31    Male     No        Sun    Dinner    2       247.39              0.013380

让我们重新初始化 weather DataFrame 为一个较小的列集合,用于本课程的其余部分:

weather = weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather

	date	month	precipitation	wind	weather
0	2012-01-01	January	0.0	4.7	drizzle
1	2012-01-02	January	10.9	4.5	rain
2	2012-01-03	January	0.8	2.3	rain
3	2012-01-04	January	20.3	4.7	rain
4	2012-01-05	January	1.3	6.1	rain
...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain
362	2012-12-28	December	0.0	1.7	rain
363	2012-12-29	December	1.5	1.7	rain
364	2012-12-30	December	0.0	1.8	drizzle
365	2012-12-31	December	0.0	2.0	drizzle

366 rows × 5 columns

20.7 使用 `value_counts()` 统计组内的值

统计分类变量在组内的出现次数可以揭示有趣的模式,通常需要在使用 groupby() 后进行。

首先,让我们回顾一下 value_counts() 在整个 DataFrame 上的工作方式。

# 天气类型的计数
weather["weather"].value_counts()

weather
rain       191
sun        118
drizzle     31
snow        21
fog          5
Name: count, dtype: int64

我们可以添加 normalize=True 来获得比例:

weather['weather'].value_counts(normalize=True)

weather
rain       0.521858
sun        0.322404
drizzle    0.084699
snow       0.057377
fog        0.013661
Name: proportion, dtype: float64

现在,要统计每个月的天气类型,我们首先按 month 分组,然后选择 weather 列并应用 value_counts()。

# 每个月天气类型的计数
weather.groupby('month')['weather'].value_counts()

month     weather
January   rain       18
          snow        7
          sun         4
          drizzle     2
          fog         0
                     ..
December  rain       23
          snow        5
          drizzle     2
          sun         1
          fog         0
Name: count, Length: 60, dtype: int64

这将返回一个具有多重索引的 Series,可以使用 reset_index() 转换为常规的 DataFrame:

weather.groupby('month')['weather'].value_counts().reset_index()

	month	weather	count
0	January	rain	18
1	January	snow	7
2	January	sun	4
3	January	drizzle	2
4	January	fog	0
...	...	...	...
55	December	rain	23
56	December	snow	5
57	December	drizzle	2
58	December	sun	1
59	December	fog	0

60 rows × 3 columns

练习

20.8 练习题:按日统计吸烟者和非吸烟者人数

使用 tips 数据集,统计每一天吸烟者和非吸烟者的数量。

tips = px.data.tips()
tips

# Your code here:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

您的结果的前几行应类似于:

day smoker  count
Fri Yes 15
Fri No  4
Sat No  45

20.9 在组内计算累积和

累积和有助于跟踪组内的运行总计。这是一种常用的操作。让我们看看如何对分组数据执行此操作。

回顾一下,以下是如何计算整个 DataFrame 的降水累积和:

# 降水的累积和
weather["precip_cumul"] = weather["precipitation"].cumsum()
weather

	date	month	precipitation	wind	weather	precip_cumul
0	2012-01-01	January	0.0	4.7	drizzle	0.0
1	2012-01-02	January	10.9	4.5	rain	10.9
2	2012-01-03	January	0.8	2.3	rain	11.7
3	2012-01-04	January	20.3	4.7	rain	32.0
4	2012-01-05	January	1.3	6.1	rain	33.3
...	...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain	1224.5
362	2012-12-28	December	0.0	1.7	rain	1224.5
363	2012-12-29	December	1.5	1.7	rain	1226.0
364	2012-12-30	December	0.0	1.8	drizzle	1226.0
365	2012-12-31	December	0.0	2.0	drizzle	1226.0

366 rows × 6 columns

要计算每个月的降水累积和,我们可以使用 groupby() 和 cumsum():

# 每个月的降水累积和
weather["precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather

	date	month	precipitation	wind	weather	precip_cumul
0	2012-01-01	January	0.0	4.7	drizzle	0.0
1	2012-01-02	January	10.9	4.5	rain	10.9
2	2012-01-03	January	0.8	2.3	rain	11.7
3	2012-01-04	January	20.3	4.7	rain	32.0
4	2012-01-05	January	1.3	6.1	rain	33.3
...	...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain	172.5
362	2012-12-28	December	0.0	1.7	rain	172.5
363	2012-12-29	December	1.5	1.7	rain	174.0
364	2012-12-30	December	0.0	1.8	drizzle	174.0
365	2012-12-31	December	0.0	2.0	drizzle	174.0

366 rows × 6 columns

20.10 练习题:按日累积小费金额

使用 tips 数据集,计算每个 day 的 total_bill 累积和,添加一个新列 cumul_total_bill_day。然后添加另一列 cumul_tip_day,包含每个 day 的 tip 累积和。

tips = px.data.tips()
tips = tips.sort_values('day')
tips
# Your code here:

	total_bill	tip	sex	smoker	day	time	size
96	27.28	4.00	Male	Yes	Fri	Dinner	2
101	15.38	3.00	Female	Yes	Fri	Dinner	2
98	21.01	3.00	Male	Yes	Fri	Dinner	2
97	12.03	1.50	Male	Yes	Fri	Dinner	2
95	40.17	4.73	Male	Yes	Fri	Dinner	4
...	...	...	...	...	...	...	...
132	11.17	1.50	Female	No	Thur	Lunch	2
131	20.27	2.83	Female	No	Thur	Lunch	2
130	19.08	1.50	Male	No	Thur	Lunch	2
128	11.38	2.00	Female	No	Thur	Lunch	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

20.11 总结

在本课程中,您学习了 pandas 中几个强大的 分组级数据转换:

添加汇总统计:使用 transform() 将分组级别的计算添加为新列
组内计数:使用 value_counts() 统计组内的出现次数
计算累积和:跟踪组内的运行总计

这些技术使您能够分析数据中特定子集中的模式和统计信息。继续练习不同的数据集,以提升您的数据操作技能!

下次见!

20.1 介绍

20.2 学习目标

20.3 导入

20.4 数据

20.5 使用 transform() 添加汇总统计

20.6 练习题:每日小费比例

20.7 使用 value_counts() 统计组内的值

20.8 练习题:按日统计吸烟者和非吸烟者人数

20.9 在组内计算累积和

20.10 练习题:按日累积小费金额

20.11 总结

20.5 使用 `transform()` 添加汇总统计

20.7 使用 `value_counts()` 统计组内的值