15 选择列子集 – 用 Python 介绍数据科学

15.1 介绍

今天我们将开始探索使用 pandas 进行数据处理!

我们的第一个重点将是选择和重命名列。通常,您的数据集包含许多您不需要的列,您希望将其缩减为几列。Pandas 使这变得容易。让我们来看一下。

15.2 学习目标

您可以使用方括号 []、filter() 和 drop() 从 DataFrame 中保留或删除列。
您可以使用 filter() 根据正则表达式模式选择列。
您可以使用 rename() 更改列名。
您可以使用正则表达式清理列名。

15.3 关于 pandas

Pandas 是一个流行的数据处理和分析库。它旨在使在 Python 中处理表格数据变得容易。

如果尚未安装,请在终端中使用以下命令安装 pandas:

pip install pandas

然后在脚本中使用以下命令导入 pandas:

import pandas as pd

15.4 雅温得 COVID-19 数据集

本课中,我们分析了 2020 年底在喀麦隆雅温得进行的一项 COVID-19 调查的结果。该调查通过抗体检测估计了该地区有多少人感染了 COVID-19。

您可以在此处了解有关此数据集的更多信息: https://www.nature.com/articles/s41467-021-25946-0

要下载数据集,请访问此链接: https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip

然后解压文件,并将 yaounde_data.csv 文件放在与笔记本相同目录下的 data 文件夹中。

让我们加载并检查数据集:

yao = pd.read_csv("data/yaounde_data.csv")
yao

	id	date_surveyed	age	age_category	age_category_3	sex	highest_education	occupation	weight_kg	height_cm	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45	45 - 64	Adult	Female	Secondary	Informal worker	95	169	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	55	45 - 64	Adult	Male	University	Salaried worker	96	185	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	31	30 - 44	Adult	Female	Secondary	Unemployed	66	169	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	17	15 - 29	Child	Female	Secondary	Unemployed	67	162	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 53 columns

15.5 使用方括号 `[]` 选择列

在 pandas 中,选择列的最常用方法是简单地使用方括号 [] 和列名。例如,要选择 age 和 sex 列,我们输入:

yao[["age", "sex"]]

	age	sex
0	45	Female
1	55	Male
...	...	...
969	31	Female
970	17	Female

971 rows × 2 columns

注意双重方括号 [[]]。没有它,您将收到一个错误:

yao["age", "sex"]

KeyError: ('age', 'sex')

如果您想选择单个列,您可以省略双重方括号,但输出将不再是 DataFrame。比较以下内容:

yao["age"] # does not return a DataFrame

0      45
1      55
       ..
969    31
970    17
Name: age, Length: 971, dtype: int64

yao[["age"]]  # returns a DataFrame

	age
0	45
1	55
...	...
969	31
970	17

971 rows × 1 columns

关键点

15.6 存储数据子集

注意,这些选择并没有修改 DataFrame 本身。如果我们想要修改后的版本,我们需要创建一个新的 DataFrame 来存储子集。例如,下面我们创建了一个仅包含三列的子集:

yao_subset = yao[["age", "sex", "igg_result"]]
yao_subset

	age	sex	igg_result
0	45	Female	Negative
1	55	Male	Positive
...	...	...	...
969	31	Female	Negative
970	17	Female	Negative

971 rows × 3 columns

如果我们想要覆盖一个 DataFrame,我们可以将子集重新赋给原始 DataFrame。让我们将 yao_subset DataFrame 覆盖为仅包含 age 列:

yao_subset = yao_subset[["age"]]
yao_subset

	age
0	45
1	55
...	...
969	31
970	17

971 rows × 1 columns

yao_subset DataFrame 已从 3 列变为 1 列。

练习

15.6.1 练习题:使用 `[]` 选择列

使用 [] 运算符选择 yao DataFrame 中的 “weight_kg” 和 “height_cm” 变量。将结果赋值给一个名为 yao_weight_height 的新 DataFrame。然后打印这个新 DataFrame。

# Your code here

专业提示

在 pandas 中,有许多方法可以选择列。在闲暇时间,您可以选择探索 .loc[] 和 .take() 方法,它们提供了额外的功能。

15.7 使用 `drop()` 排除列

有时删除不需要的列比明确选择需要的列更有用。

要删除列,我们可以使用带有 columns 参数的 drop() 方法。要删除 age 列,我们输入:

yao.drop(columns=["age"])

	id	date_surveyed	age_category	age_category_3	sex	highest_education	occupation	weight_kg	height_cm	is_smoker	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Female	Secondary	Informal worker	95	169	Non-smoker	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	Male	University	Salaried worker	96	185	Ex-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Female	Secondary	Unemployed	66	169	Non-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Female	Secondary	Unemployed	67	162	Non-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 52 columns

要删除几列:

yao.drop(columns=["age", "sex"])

	id	date_surveyed	age_category	age_category_3	highest_education	occupation	weight_kg	height_cm	is_smoker	is_pregnant	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Secondary	Informal worker	95	169	Non-smoker	No	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	University	Salaried worker	96	185	Ex-smoker	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Secondary	Unemployed	66	169	Non-smoker	No	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Secondary	Unemployed	67	162	Non-smoker	No response	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 51 columns

同样,注意这并没有修改 DataFrame 本身。如果我们想要修改后的版本,我们需要创建一个新的 DataFrame 来存储子集。例如,下面我们创建了一个删除 age 和 sex 的子集:

yao_subset = yao.drop(columns=["age", "sex"])
yao_subset

	id	date_surveyed	age_category	age_category_3	highest_education	occupation	weight_kg	height_cm	is_smoker	is_pregnant	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Secondary	Informal worker	95	169	Non-smoker	No	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	University	Salaried worker	96	185	Ex-smoker	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Secondary	Unemployed	66	169	Non-smoker	No	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Secondary	Unemployed	67	162	Non-smoker	No response	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 51 columns

练习

15.7.1 练习题:使用 `drop()` 删除列

从 yao DataFrame 中删除 highest_education 和 consultation 列。将结果赋值给一个名为 yao_no_education_consultation 的新 DataFrame。打印这个新 DataFrame。

# Your code here

15.8 使用 `filter()` 按正则表达式选择列

filter() 方法及其 regex 参数提供了一种基于列名中模式来选择列的强大方式。例如,要选择包含字符串 “ig” 的列,我们可以编写:

yao.filter(regex="ig")

	highest_education	weight_kg	height_cm	neighborhood	igg_result	igm_result	symp_fatigue
0	Secondary	95	169	Briqueterie	Negative	Negative	No
1	University	96	185	Briqueterie	Positive	Negative	No
...	...	...	...	...	...	...	...
969	Secondary	66	169	Tsinga Oliga	Negative	Negative	No
970	Secondary	67	162	Tsinga Oliga	Negative	Negative	No

971 rows × 7 columns

参数 regex 指定要匹配的模式。Regex 代表正则表达式,指的是定义搜索模式的字符序列。

要选择以字符串 “ig” 开头的列,我们编写:

yao.filter(regex="^ig")

	igg_result	igm_result
0	Negative	Negative
1	Positive	Negative
...	...	...
969	Negative	Negative
970	Negative	Negative

971 rows × 2 columns

符号 ^ 是一个正则表达式字符,匹配字符串的开头。

要选择以字符串 “result” 结尾的列,我们可以编写:

yao.filter(regex="result$")

	igg_result	igm_result
0	Negative	Negative
1	Positive	Negative
...	...	...
969	Negative	Negative
970	Negative	Negative

971 rows × 2 columns

字符 $ 是正则表达式,它匹配字符串的结尾。

专业提示

正则表达式非常难以记忆,但像 ChatGPT 这样的 LLM 在生成正确的模式方面非常擅长。例如,只需询问:“以 ‘ig’ 开头的字符串的正则表达式是什么?”

练习

15.8.1 练习题:使用正则表达式选择列

选择 yao DataFrame 中所有以 “is_” 开头的列。将结果赋值给一个名为 yao_is_columns 的新 DataFrame。然后打印这个新 DataFrame。

# Your code here

15.9 使用 `rename()` 更改列名

我们可以使用 rename() 方法更改列名:

yao.rename(columns={"age": "patient_age", "sex": "patient_sex"})

	id	date_surveyed	patient_age	age_category	age_category_3	patient_sex	highest_education	occupation	weight_kg	height_cm	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45	45 - 64	Adult	Female	Secondary	Informal worker	95	169	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	55	45 - 64	Adult	Male	University	Salaried worker	96	185	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	31	30 - 44	Adult	Female	Secondary	Unemployed	66	169	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	17	15 - 29	Child	Female	Secondary	Unemployed	67	162	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 53 columns

练习

15.9.1 练习题:使用 `rename()` 重命名列

将 yao DataFrame 中的 age_category 列重命名为 age_cat。将结果赋值给一个名为 yao_age_cat 的新 DataFrame。然后打印这个新 DataFrame。

# Your code here

15.10 清理混乱的列名

要清理列名,您可以在 pandas 中使用带有 str.replace() 方法的正则表达式。

这里是如何在具有混乱列名的测试 DataFrame 上进行操作。混乱的列名是指包含空格、特殊字符或其他非字母数字字符的名称。

test_df = pd.DataFrame(
    {"good_name": range(3), "bad name": range(3), "bad*@name*2": range(3)}
)
test_df

	good_name	bad name	bad@name2
0	0	0	0
1	1	1	1
2	2	2	2

这样的列名并不理想,因为例如,我们无法像处理干净的名称那样使用点操作符选择它们:

test_df.good_name  # this works

0    0
1    1
2    2
Name: good_name, dtype: int64

但这不起作用:

test_df.bad name

      test_df.bad name
                 ^
SyntaxError: invalid syntax

我们可以使用 str.replace() 方法结合正则表达式自动清理这些名称。

clean_names = test_df.columns.str.replace(r'[^a-zA-Z0-9]', '_', regex=True)

正则表达式 r'[^a-zA-Z0-9]' 匹配任何不是字母(无论是大写还是小写)或数字的字符。str.replace() 方法将这些字符替换为下划线 (‘_’),使列名更易读并可在点表示法中使用。

现在我们可以用清理过的名称替换 DataFrame 中的列名:

test_df.columns = clean_names
test_df

	good_name	bad_name	bad__name_2
0	0	0	0
1	1	1	1
2	2	2	2

练习

15.10.1 练习题:使用正则表达式清理列名

考虑下方定义的具有混乱列名的数据框。使用 str.replace() 方法清理列名。

cleaning_practice = pd.DataFrame(
    {"Aloha": range(3), "Bell Chart": range(3), "Animals@the zoo": range(3)}
)
cleaning_practice

	Aloha	Bell Chart	Animals@the zoo
0	0	0	0
1	1	1	1
2	2	2	2

15.11 总结

希望本课向您展示了 pandas 在数据处理方面是多么直观和有用!

这是系列基础数据整理技术的第一课:下节课再见,了解更多内容。

15.1 介绍

15.2 学习目标

15.3 关于 pandas

15.4 雅温得 COVID-19 数据集

15.5 使用方括号 [] 选择列

15.6 存储数据子集

15.6.1 练习题:使用 [] 选择列

15.7 使用 drop() 排除列

15.7.1 练习题:使用 drop() 删除列

15.8 使用 filter() 按正则表达式选择列

15.8.1 练习题:使用正则表达式选择列

15.9 使用 rename() 更改列名

15.9.1 练习题:使用 rename() 重命名列

15.10 清理混乱的列名

15.10.1 练习题:使用正则表达式清理列名

15.11 总结

15.5 使用方括号 `[]` 选择列

15.6.1 练习题:使用 `[]` 选择列

15.7 使用 `drop()` 排除列

15.7.1 练习题:使用 `drop()` 删除列

15.8 使用 `filter()` 按正则表达式选择列

15.9 使用 `rename()` 更改列名

15.9.1 练习题:使用 `rename()` 重命名列