2．1 urllib的使用 -Python3网络爬虫开发实战(上)

当前位置:　首页>> 技术小册>> Python3网络爬虫开发实战(上)

2.1 urllib的使用

在Python中，urllib 是一个功能强大的库，用于处理URL（统一资源定位符）和与之相关的网络操作，如发送请求、处理响应等。它是Python标准库的一部分，无需额外安装即可使用，非常适合进行基础的网络爬虫开发。本章节将详细介绍 urllib 库中几个核心模块的使用方法，包括 urllib.request、urllib.parse 和 urllib.error，并通过实例展示如何使用这些模块来构建网络爬虫的基本功能。

2.1.1 urllib.request：发送请求

urllib.request 是 urllib 库中用于打开和读取URLs的模块。它提供了一个高级接口，可以发送HTTP请求并获取响应。这是构建网络爬虫时最常用的模块之一。

基本请求

最简单的使用方式是通过 urllib.request.urlopen() 函数，它接受一个URL作为参数，并返回一个HTTPResponse对象，该对象包含了响应的元数据和响应体。

from urllib.request import urlopen
# 发送请求
response = urlopen('http://example.com')
# 读取响应内容
html = response.read().decode('utf-8')
print(html)
# 关闭响应对象（Python 3.5+ 中，with语句可以自动管理资源的打开与关闭）
response.close()
# 使用with语句自动关闭
with urlopen('http://example.com') as response:
    html = response.read().decode('utf-8')
    print(html)

添加HTTP头部

在爬虫开发中，经常需要模拟浏览器发送请求，这时就需要通过 Request 对象来添加HTTP头部信息。

from urllib.request import Request, urlopen
url = 'http://example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
request = Request(url, headers=headers)
with urlopen(request) as response:
    html = response.read().decode('utf-8')
    print(html)

处理重定向

默认情况下，urlopen() 会自动处理HTTP重定向。如果你需要禁用这一行为，可以通过设置 Request 对象的 allow_redirects 参数为 False。

from urllib.request import Request, urlopen
url = 'http://example.com/redirect'
request = Request(url, allow_redirects=False)
try:
    with urlopen(request) as response:
        # 处理重定向前的响应
        print(response.status)
except Exception as e:
    print(e)

2.1.2 urllib.parse：解析URL

urllib.parse 模块提供了URL解析和构建的功能。这在处理复杂的URL结构或生成新的URL时非常有用。

解析URL

urlparse() 函数可以将一个完整的URL分解为不同的组成部分（如协议、网络位置、路径等）。

from urllib.parse import urlparse
url = 'http://user:pass@www.example.com:80/path;param?query=string#fragment'
parsed_url = urlparse(url)
print(parsed_url.scheme)  # http
print(parsed_url.netloc)  # user:pass@www.example.com:80
print(parsed_url.path)    # /path;param
print(parsed_url.params)  # param
print(parsed_url.query)   # query=string
print(parsed_url.fragment) # fragment

构建URL

虽然 urlparse 主要用于解析URL，但 urlunparse() 函数可以与 ParseResult 对象一起使用来重新构建URL。

from urllib.parse import ParseResult, urlunparse
data = ('http', 'www.example.com', '/path', 'param', 'query=string', 'fragment')
parsed_result = ParseResult(*data)
new_url = urlunparse(parsed_result)
print(new_url)

2.1.3 urllib.error：处理异常

在网络编程中，异常处理是非常重要的一环。urllib.error 模块定义了几种可能发生的异常，如 URLError 和 HTTPError。

URLError

URLError 是一个基类，通常用于处理如网络连接问题、URL无效等低级错误。

from urllib.request import urlopen
from urllib.error import URLError
try:
    response = urlopen('http://invalid-url.com')
except URLError as e:
    print(e.reason)

HTTPError

HTTPError 是 URLError 的子类，专门用于处理HTTP错误，如404（未找到）、500（服务器内部错误）等。

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
try:
    response = urlopen('http://example.com/nonexistent-page')
except HTTPError as e:
    print(f'HTTP Error: {e.code} {e.reason}')
except URLError as e:
    print(f'URL Error: {e.reason}')

2.1.4 实战应用：爬取网页数据

结合上述知识，我们可以编写一个简单的网络爬虫，用于爬取指定网页的内容。

from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
def fetch_url(url, headers=None):
    """
    发送HTTP请求并返回响应内容
    :param url: 目标URL
    :param headers: 请求头，默认为None
    :return: 响应内容（字符串）
    """
    try:
        request = Request(url, headers=headers)
        with urlopen(request) as response:
            return response.read().decode('utf-8')
    except HTTPError as e:
        print(f'HTTP Error: {e.code} {e.reason}')
        return None
    except URLError as e:
        print(f'URL Error: {e.reason}')
        return None
# 使用示例
url = 'http://example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
html = fetch_url(url, headers)
if html:
    print(html[:1000] + '...')  # 打印前1000个字符以节省空间

以上便是 urllib 库在Python网络爬虫开发中的基本使用方法。通过掌握这些基础知识，你可以构建出功能强大的网络爬虫，用于抓取互联网上的各种数据资源。然而，需要注意的是，网络爬虫的开发应遵循网站的 robots.txt 规则，尊重网站的数据版权和使用协议，避免对网站服务器造成不必要的负担。