24  在 Python 中使用 LLM 进行文本生成

24.1 简介

在本教程中,我们将探讨如何利用大型语言模型(LLMs)使用 OpenAI 的 API 生成文本。我们将使用 gpt-4o-mini 模型来生成对固定和可变提示的响应,使用辅助函数和向量化优化我们的代码,并使用 pandas DataFrame 处理数据。

24.2 学习目标

  • 设置 OpenAI 客户端
  • 定义和使用简单函数生成文本
  • 使用向量化将函数应用于 DataFrame

24.3 设置 OpenAI 客户端

首先,我们需要使用您的 API 密钥设置 OpenAI 客户端。在这里,我们将密钥存储在名为 local_settings.py 的文件中,然后将其导入到我们的脚本中。

from openai import OpenAI
import pandas as pd
import numpy as np
from local_settings import OPENAI_KEY

# 设置 OpenAI API 密钥
# 使用您的 API 密钥初始化 OpenAI 客户端
client = OpenAI(api_key=OPENAI_KEY)

或者,您可以在设置 api_key 时直接传递您的 API 密钥,但请注意不要在代码中泄露,尤其是如果您计划共享或发布代码时。

24.4 进行 API 调用

让我们进行一次 API 调用,使用 gpt-4o-mini 模型生成对提示的响应。

response = client.chat.completions.create(
    model="gpt-4o-mini", messages=[{"role": "user", "content": "What is the most tourist-friendly city in France?"}]
)
print(response.choices[0].message.content)
Paris is often considered the most tourist-friendly city in France. It is renowned for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and Montmartre. The city's extensive public transportation system, including buses and the Metro, makes it easy for tourists to navigate. Additionally, Paris offers a wide range of accommodations, dining options, and cultural experiences catering to visitors. Many tourist information centers are also available to help travelers with any inquiries or guidance they may need. Other cities like Nice, Lyon, and Marseille are also popular with tourists but may not offer the same level of convenience and appeal as Paris.

24.5 定义辅助函数

为了简化我们的代码并避免重复,我们将定义一个用于进行 API 调用的辅助函数。API 调用包含大量样板代码,因此将此逻辑封装在函数中可以使我们的代码更简洁、更易维护。

如果您忘记如何构建 API 调用,请参考 OpenAI API 文档 或在线搜索“OpenAI Python API example”。

以下是我们如何定义 llm_chat 函数:

def llm_chat(message):
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

此函数接受一个 message 作为输入,发送给 LLM,并返回生成的响应。 model 参数指定要使用的模型 —— 在此情况下为 gpt-4o-mini。我们使用此模型是因为它在质量、速度和成本之间具有良好的平衡。如果您需要更高性能的模型,可以使用 gpt-4o,但请注意不要超过您的 API 配额。

24.6 固定问题

让我们首先向 gpt-4o-mini 模型发送一个固定问题并获取响应。

# 示例用法
response = llm_chat("What is the most tourist-friendly city in France?")
print(response)
Paris is often considered the most tourist-friendly city in France. It is a major global city known for its iconic landmarks, such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is well-equipped to accommodate tourists, with a wide range of hotels, restaurants, and attractions. Additionally, Paris has extensive public transportation options, including buses and the metro, making it easy for visitors to navigate.

Other cities in France that are also quite tourist-friendly include Nice, Lyon, and Marseille, each offering unique attractions and experiences. However, Paris remains the most popular destination for international tourists.
练习

24.7 练习题:获取巴西最适合旅游的城市

使用 llm_chat 函数询问模型巴西最适合旅游的城市。将响应存储在名为 rec_brazil 的变量中。打印响应。

# 您的代码

24.8 可变作为提示输入

通常,您会希望基于不同的输入生成响应。让我们创建一个函数,该函数接受一个国家作为输入,并询问模型该国最适合旅游的城市。

def city_rec(country):
    prompt = f"What is the most tourist-friendly city in {country}?"
    return llm_chat(prompt)

现在,您可以通过调用 city_rec("Country Name") 来获取不同国家的推荐:

city_rec("Nigeria")
'Lagos is often considered the most tourist-friendly city in Nigeria. As the largest city in the country and one of the fastest-growing cities in the world, Lagos offers a vibrant blend of culture, nightlife, beaches, and historical sites. Tourists can explore attractions like Lekki Conservation Centre, Tarkwa Bay Beach, the Nike Art Gallery, and the National Museum Lagos. The city also has a diverse culinary scene and a range of accommodations, making it accessible for both international and domestic travelers.\n\nOther cities like Abuja, the capital, and tourist destinations like Calabar, known for its festivals, and Port Harcourt, with its scenic beauty, also attract visitors, but Lagos remains the primary hub for tourists due to its numerous amenities and activities.'

然而,如果我们尝试将此函数直接用于国家列表或 DataFrame 列,它不会逐个处理每个国家。而是会尝试将列表连接成一个字符串,这不是我们想要的行为。

# 错误用法
country_df = pd.DataFrame({"country": ["Nigeria", "Chile", "France", "Canada"]})

response = city_rec(country_df["country"])

print(response)
Determining the "most tourist-friendly" city can be subjective and depends on various factors such as infrastructure, hospitality, safety, attractions, and overall visitor experience. However, I can provide a general idea based on popular perceptions:

1. **Nigeria**: Cities like Lagos and Abuja offer a vibrant culture and various attractions but may face challenges related to safety and infrastructure.
   
2. **Chile**: Santiago, the capital, is known for its friendly locals, accessibility, and numerous attractions, making it quite tourist-friendly.

3. **France**: Paris is one of the most famous tourist destinations in the world, known for its rich history, culture, and hospitality, making it very tourist-friendly.

4. **Canada**: Cities like Toronto and Vancouver are well-regarded for their friendliness, safety, and the quality of services offered to tourists.

Based on these considerations, **Paris, France** would likely be considered the most tourist-friendly city among the options given, primarily due to its extensive tourist infrastructure, world-renowned attractions, and overall visitor experience.

要逐个处理每个国家,我们可以使用 NumPy 的 vectorize 函数。此函数将 city_rec 转换为可以接受数组(如列表或 NumPy 数组)并按元素应用函数的形式。

# 向量化函数
city_rec_vec = np.vectorize(city_rec)

# 将函数应用于每个国家
country_df["city_rec"] = city_rec_vec(country_df["country"])
country_df
country city_rec
0 Nigeria Lagos is often considered the most tourist-fri...
1 Chile The most tourist-friendly city in Chile is oft...
2 France Paris is widely regarded as the most tourist-f...
3 Canada While several Canadian cities are known for be...

此代码将输出一个包含新列 city_rec 的 DataFrame,其中包含每个国家对应的城市推荐。

练习

24.9 练习题:获取当地菜肴

创建一个名为 get_local_dishes 的函数,该函数接受一个国家名称作为输入,并返回该国一些最著名的当地菜肴。然后,将此函数向量化并应用于 country_df DataFrame,为每个国家添加一个包含当地菜肴推荐的列。

# 您的代码

24.10 自动摘要:电影数据集

在此示例中,我们将使用来自 vega_datasets 的电影数据集为每部电影生成自动摘要。我们将每部电影的数据转换为字典,并将其作为输入提供给 LLM 生成一段关于其表现的摘要。

首先,让我们加载电影数据集并预览前几行:

import pandas as pd
import vega_datasets as vd

# 加载电影数据集
movies = vd.data.movies().head()  # 仅使用前 5 行以节省 API 积分
movies
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0

接下来,我们将 DataFrame 的每一行转换为字典。这将有助于将数据传递给 LLM。

# 将每部电影的数据转换为字典
movies.to_dict(orient="records")
[{'Title': 'The Land Girls',
  'US_Gross': 146083.0,
  'Worldwide_Gross': 146083.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 8000000.0,
  'Release_Date': 'Jun 12 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Gramercy',
  'Source': None,
  'Major_Genre': None,
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.1,
  'IMDB_Votes': 1071.0},
 {'Title': 'First Love, Last Rites',
  'US_Gross': 10876.0,
  'Worldwide_Gross': 10876.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 300000.0,
  'Release_Date': 'Aug 07 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Strand',
  'Source': None,
  'Major_Genre': 'Drama',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.9,
  'IMDB_Votes': 207.0},
 {'Title': 'I Married a Strange Person',
  'US_Gross': 203134.0,
  'Worldwide_Gross': 203134.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 250000.0,
  'Release_Date': 'Aug 28 1998',
  'MPAA_Rating': None,
  'Running_Time_min': nan,
  'Distributor': 'Lionsgate',
  'Source': None,
  'Major_Genre': 'Comedy',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.8,
  'IMDB_Votes': 865.0},
 {'Title': "Let's Talk About Sex",
  'US_Gross': 373615.0,
  'Worldwide_Gross': 373615.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 300000.0,
  'Release_Date': 'Sep 11 1998',
  'MPAA_Rating': None,
  'Running_Time_min': nan,
  'Distributor': 'Fine Line',
  'Source': None,
  'Major_Genre': 'Comedy',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': 13.0,
  'IMDB_Rating': nan,
  'IMDB_Votes': nan},
 {'Title': 'Slam',
  'US_Gross': 1009819.0,
  'Worldwide_Gross': 1087521.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 1000000.0,
  'Release_Date': 'Oct 09 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Trimark',
  'Source': 'Original Screenplay',
  'Major_Genre': 'Drama',
  'Creative_Type': 'Contemporary Fiction',
  'Director': None,
  'Rotten_Tomatoes_Rating': 62.0,
  'IMDB_Rating': 3.4,
  'IMDB_Votes': 165.0}]

让我们将此新列存储在 DataFrame 中:

movies["full_dict"] = movies.to_dict(orient="records")
movies
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes full_dict
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0 {'Title': 'The Land Girls', 'US_Gross': 146083...
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0 {'Title': 'First Love, Last Rites', 'US_Gross'...
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0 {'Title': 'I Married a Strange Person', 'US_Gr...
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN {'Title': 'Let's Talk About Sex', 'US_Gross': ...
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0 {'Title': 'Slam', 'US_Gross': 1009819.0, 'Worl...

现在,让我们定义一个 movie_performance 函数,该函数接受电影的数据字典,构建提示,并调用 llm_chat 函数以获取摘要:

def movie_performance(movie_data):
    prompt = f"Considering the following data on this movie {movie_data}, provide a one-paragraph summary of its performance for my report."
    return llm_chat(prompt)

我们将向量化此函数,以便可以将其应用于整个 full_dict 列:

import numpy as np

# 向量化函数以应用于 DataFrame
movie_performance_vec = np.vectorize(movie_performance)

让我们使用一个示例测试我们的函数:

# 示例用法
movie_performance("Name: Kene's Movie, Sales: 100,000 USD")
"Kene's Movie has demonstrated a strong performance in the market, achieving sales of 100,000 USD. This figure indicates a solid reception among audiences, reflecting effective marketing strategies and a compelling storyline that resonated with viewers. The sales performance not only highlights the film's commercial viability but also suggests potential for future success, whether through additional revenue streams such as merchandise, streaming rights, or international distribution. Overall, Kene's Movie can be considered a success within its category, contributing positively to its production team's portfolio."

最后,我们将应用向量化函数为每部电影生成摘要:

# 为每部电影生成摘要
movies["llm_summary"] = movie_performance_vec(movies["full_dict"])

您现在可以将生成的摘要与 DataFrame 一起保存到 CSV 文件:

# 将结果保存到 CSV 文件
movies.to_csv("movies_output.csv", index=False)

这种方法允许您基于每部电影的完整数据生成详细摘要,这对于自动报告和数据分析非常有用。

练习

24.11 练习题:天气摘要

使用来自 vega_datasetsseattle_weather 数据集的前 5 行,创建一个函数,该函数接受某一天的所有天气列,并生成该天天气状况的摘要。该函数应使用 LLM 根据提供的数据生成一段用于报告的一段摘要。将摘要存储在名为 weather_summary 的列中。

weather = vd.data.seattle_weather().head()
weather
date precipitation temp_max temp_min wind weather
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 2012-01-02 10.9 10.6 2.8 4.5 rain
2 2012-01-03 0.8 11.7 7.2 2.3 rain
3 2012-01-04 20.3 12.2 5.6 4.7 rain
4 2012-01-05 1.3 8.9 2.8 6.1 rain
# 您的代码

24.12 总结

在本教程中,我们学习了在 Python 中使用 OpenAI 的 LLM 进行文本生成的基础知识,创建了辅助函数,并通过向量化将这些函数应用于数据集。

在下一课中,我们将探讨结构化输出,允许我们指定从 LLM 获得响应的格式。我们将使用这一点从非结构化文本中提取结构化数据,这在数据分析中是常见任务。