2．4 httpx的使用 -Python3网络爬虫开发实战(上)

当前位置:　首页>> 技术小册>> Python3网络爬虫开发实战(上)

2.4 httpx的使用

在Python网络爬虫开发中，httpx 是一个强大且易于使用的异步HTTP客户端库，它提供了同步和异步两种API，让开发者能够更加方便地执行HTTP请求，处理响应，以及管理HTTP连接。与Python标准库中的requests相比，httpx 天然支持异步，这使得在处理大量并发请求时效率更高，特别适用于构建高性能的网络爬虫或Web服务。本章将深入介绍httpx 的基本使用、高级特性以及在网络爬虫中的应用。

2.4.1 安装与基础配置

首先，你需要安装httpx。在Python环境中，你可以通过pip命令轻松完成安装：

pip install httpx

安装完成后，你可以开始使用httpx 进行HTTP请求了。httpx 的API设计简洁直观，与requests类似，但更加强调异步支持。

2.4.2 发起HTTP请求

同步请求

尽管httpx以异步为亮点，但它也完全支持同步请求。以下是使用httpx发起GET请求的示例：

import httpx
# 发起GET请求
response = httpx.get('https://httpbin.org/get')
# 访问响应内容
print(response.status_code)  # 打印HTTP状态码
print(response.text)         # 打印响应体文本
print(response.json())       # 尝试将响应体解析为JSON

类似地，POST请求可以这样发起：

response = httpx.post('https://httpbin.org/post', json={'key': 'value'})
print(response.json())

异步请求

httpx的异步功能通过AsyncClient类实现。使用异步请求前，需要确保你的环境支持异步编程（如使用async和await关键字）。

import httpx
import asyncio
async def fetch_url(url):
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        return response.text
# 运行异步函数
url = 'https://httpbin.org/get'
loop = asyncio.get_event_loop()
response_text = loop.run_until_complete(fetch_url(url))
print(response_text)
# 在Python 3.7+中，更推荐使用asyncio.run()
# result = asyncio.run(fetch_url(url))
# print(result)

2.4.3 高级特性

超时与重试

httpx允许你自定义请求的超时时间和重试策略。这对于处理不稳定的网络连接或目标网站的暂时不可用非常有用。

import httpx
# 设置超时
try:
    response = httpx.get('https://example.com', timeout=5.0)  # 5秒超时
    print(response.status_code)
except httpx.TimeoutException:
    print("请求超时")
# 自定义重试策略
from httpx import RetryConfig
retries = RetryConfig(total_retries=3, backoff_factor=0.5)
try:
    response = httpx.get('https://example.com', retries=retries)
    print(response.status_code)
except Exception as e:
    print(f"请求失败: {e}")

HTTP代理

httpx支持通过HTTP代理发起请求，这对于绕过IP封锁或测试特定网络环境下的应用行为非常有用。

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
response = httpx.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())

自定义请求头与认证

你可以轻松地向请求中添加自定义的请求头，甚至执行基于HTTP的认证。

headers = {
    'User-Agent': 'Custom User-Agent',
    'Authorization': 'Bearer YOUR_TOKEN_HERE'
}
response = httpx.get('https://api.example.com/data', headers=headers)
print(response.json())

2.4.4 在网络爬虫中的应用

在网络爬虫中，httpx的异步能力尤为关键。它允许你同时向多个目标URL发起请求，从而显著提高数据采集的速度和效率。

假设你需要从一个网站上抓取多个页面的数据，可以使用asyncio.gather来并发执行多个异步请求：

import httpx
import asyncio
async def fetch_page(url):
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        return response.text
async def main():
    urls = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
    tasks = [fetch_page(url) for url in urls]
    results = await asyncio.gather(*tasks)
    for result in results:
        print(result[:100] + '...')  # 打印每页内容的前100个字符
# 运行主函数
asyncio.run(main())

此外，httpx的异步特性还能帮助你更好地管理爬虫资源，如连接池和会话，确保在长时间运行或高并发场景下系统的稳定性和性能。

2.4.5 总结

httpx作为一个功能丰富、易于使用的HTTP客户端库，在网络爬虫开发中展现出了巨大的潜力。通过支持同步和异步请求、灵活的请求配置（如超时、重试、代理等）、以及便捷的异步编程接口，httpx使得构建高效、稳定的网络爬虫变得更加简单。在编写本书《Python3网络爬虫开发实战(上)》的后续章节中，我们将继续探索httpx的高级特性及其在网络爬虫领域的更多应用场景，助力读者成为网络爬虫开发的专家。