Python 爬虫入门（六）：urllib库的使用方法

前言
1. urllib 概述
2. urllib.request 模块
- 2.1 发送GET请求
- 2.2 发送POST请求
- 2.3 添加headers
- 2.4 处理异常
3. urllib.error 模块
4. urllib.parse 模块
- 4.1 URL解析
- 4.2 URL编码和解码
- 4.3 拼接URL
5. urllib.robotparser 模块
6. 实战示例: 爬取豆瓣电影Top250
7. urllib vs requests
8. 注意事项
总结

前言

欢迎来到"Python 爬虫入门"系列的第六篇文章。今天我们来学习Python标准库中的urllib,这是一个用于处理URL的强大工具包。
urllib是Python内置的HTTP请求库,不需要额外安装,就可以直接使用。它提供了一系列用于操作URL的函数和类,可以用来发送请求、处理响应、解析URL等。尽管现在很多人更喜欢使用requests库,但是了解和掌握urllib仍然很有必要,因为它是很多其他库的基础,而且在一些特殊情况下可能会更有优势。
在这篇文章里,我会详细介绍urllib的四个主要模块:request、error、parse和robotparser,并通过实际的代码示例来展示它们的用法。

1. urllib 概述

urllib是 Python 标准库中用于URL处理的模块集合,不需要通过 pip 安装。

它包含了多个处理URL的模块:

urllib.request: 用于打开和读取URL
urllib.error: 包含urllib.request抛出的异常
urllib.parse: 用于解析URL
urllib.robotparser: 用于解析robots.txt文件

这些模块提供了一系列强大的工具,可以帮助我们进行网络请求和URL处理。接下来,我们将逐一介绍这些模块的主要功能和使用方法。

2. urllib.request 模块

urllib.request模块是urllib中最常用的模块,它提供了一系列函数和类来打开URL(主要是HTTP)。

我们可以使用这个模块来模拟浏览器发送GET和POST请求。

2.1 发送GET请求

使用urllib.request发送GET请求非常简单,我们可以使用urlopen()函数:

import urllib.request
import gzip
import iourl = 'https://www.python.org/'
response = urllib.request.urlopen(url)# 获取响应头
content_type = response.headers.get('Content-Encoding')# 读取数据
data = response.read()# 检查是否需要解压缩
if content_type == 'gzip':buf = io.BytesIO(data)with gzip.GzipFile(fileobj=buf) as f:data = f.read()print(data.decode('utf-8'))

这段代码会打开Python官网,并打印出网页的HTML内容。
在这里插入图片描述

2.2 发送POST请求

发送POST请求稍微复杂一些,我们需要使用Request对象:

import urllib.request
import urllib.parseurl = 'http://httpbin.org/post'
data = urllib.parse.urlencode({'name': 'John', 'age': 25}).encode('utf-8')
req = urllib.request.Request(url, data=data, method='POST')
response = urllib.request.urlopen(req)print(response.read().decode('utf-8'))

这段代码向httpbin.org发送了一个POST请求,包含了name和age两个参数。
在这里插入图片描述

2.3 添加headers

在实际的爬虫中,我们常常需要添加headers来模拟浏览器行为。

可以在创建Request对象时添加headers:

import urllib.request
import gzip
import iourl = 'https://www.python.org/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)# 获取响应头中的 Content-Encoding
content_encoding = response.headers.get('Content-Encoding')# 读取数据
data = response.read()# 如果数据被 gzip 压缩，则需要解压
if content_encoding == 'gzip':buf = io.BytesIO(data)with gzip.GzipFile(fileobj=buf) as f:data = f.read()# 尝试使用 'utf-8' 解码
try:print(data.decode('utf-8'))
except UnicodeDecodeError:print("Cannot decode data with 'utf-8' encoding.")

添加headers后,运行结果如下：
在这里插入图片描述

2.4 处理异常

在进行网络请求时,可能会遇到各种异常情况。我们可以使用try-except语句来处理这些异常:

import urllib.request
import urllib.error
import gzip
import iotry:# 发送请求并获取响应response = urllib.request.urlopen('https://www.python.org/')# 获取响应头中的 Content-Encodingcontent_encoding = response.headers.get('Content-Encoding')# 读取数据data = response.read()# 如果数据被 gzip 压缩，则需要解压if content_encoding == 'gzip':buf = io.BytesIO(data)with gzip.GzipFile(fileobj=buf) as f:data = f.read()# 尝试使用 'utf-8' 解码print(data.decode('utf-8'))
except urllib.error.URLError as e:print(f'URLError: {e.reason}')
except urllib.error.HTTPError as e:print(f'HTTPError: {e.code}, {e.reason}')
except UnicodeDecodeError:print("Cannot decode data with 'utf-8' encoding.")

这段代码会捕获URLError和HTTPError,这两种异常都定义在urllib.error模块中。

3. urllib.error 模块

urllib.error模块定义了urllib.request可能抛出的异常类。主要有两个异常类:

URLError: 由urllib.request产生的异常的基类。
HTTPError: URLError的子类,用于处理HTTP和HTTPS URL的错误。

我们已经在上面的例子中看到了如何捕获和处理这些异常。

4. urllib.parse 模块

urllib.parse模块提供了许多URL处理的实用函数,例如解析、引用、拆分和组合。

4.1 URL解析

from urllib.parse import urlparseurl = 'https://www.python.org/doc/?page=1#introduction'
parsed = urlparse(url)print(parsed)
print(f'Scheme: {parsed.scheme}')
print(f'Netloc: {parsed.netloc}')
print(f'Path: {parsed.path}')
print(f'Params: {parsed.params}')
print(f'Query: {parsed.query}')
print(f'Fragment: {parsed.fragment}')

这段代码会解析URL,并打印出各个组成部分。
在这里插入图片描述

4.2 URL编码和解码

在处理URL时,我们经常需要对参数进行编码和解码:

from urllib.parse import urlencode, unquoteparams = {'name': 'John Doe', 'age': 30, 'city': 'New York'}
encoded = urlencode(params)
print(f'Encoded: {encoded}')decoded = unquote(encoded)
print(f'Decoded: {decoded}')

urlencode()函数将字典转换为URL编码的字符串,而unquote()函数则进行解码。
在这里插入图片描述

4.3 拼接URL

from urllib.parse import urljoinbase_url = 'https://www.python.org/doc/'
relative_url = 'tutorial/index.html'
full_url = urljoin(base_url, relative_url)print(full_url)

urljoin()函数可以方便地将一个基础URL和相对URL拼接成一个完整的URL。
在这里插入图片描述

5. urllib.robotparser 模块

urllib.robotparser模块提供了一个RobotFileParser类,用于解析robots.txt文件。

robots.txt是一个网站用来告诉爬虫哪些页面可以爬取,哪些不可以爬取的文件。

from urllib.robotparser import RobotFileParserrp = RobotFileParser()
rp.set_url('https://www.python.org/robots.txt')
rp.read()print(rp.can_fetch('*', 'https://www.python.org/'))
print(rp.can_fetch('*', 'https://www.python.org/admin/'))

这段代码会读取Python官网的robots.txt文件,然后检查是否允许爬取某些URL。
在这里插入图片描述

6. 实战示例: 爬取豆瓣电影Top250

现在,让我们用我们学到的知识来写一个实际的爬虫,爬取豆瓣电影Top250的信息。

import urllib.request
import urllib.error
import re
from bs4 import BeautifulSoupdef get_movie_info(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}try:req = urllib.request.Request(url, headers=headers)response = urllib.request.urlopen(req)html = response.read().decode('utf-8')soup = BeautifulSoup(html, 'html.parser')movie_list = soup.find('ol', class_='grid_view')for movie_li in movie_list.find_all('li'):rank = movie_li.find('em').stringtitle = movie_li.find('span', class_='title').stringrating = movie_li.find('span', class_='rating_num').stringif movie_li.find('span', class_='inq'):quote = movie_li.find('span', class_='inq').stringelse:quote = "N/A"print(f"Rank: {rank}")print(f"Title: {title}")print(f"Rating: {rating}")print(f"Quote: {quote}")print('-' * 50)except urllib.error.URLError as e:if hasattr(e, 'reason'):print(f'Failed to reach the server. Reason: {e.reason}')elif hasattr(e, 'code'):print(f'The server couldn\'t fulfill the request. Error code: {e.code}')# 爬取前5页
for i in range(5):url = f'https://movie.douban.com/top250?start={i*25}'get_movie_info(url)

这个爬虫会爬取豆瓣电影Top250的前5页,每页25部电影,共125部电影的信息。它使用了我们之前学到的urllib.request发送请求,使用BeautifulSoup解析HTML,并处理了可能出现的异常。
在这里插入图片描述

7. urllib vs requests

虽然urllib是Python的标准库,但在实际开发中,很多人更喜欢使用requests库。

这是因为:

易用性: requests的API设计更加人性化,使用起来更加直观和简单。
功能强大: requests自动处理了很多urllib需要手动处理的事情,比如保持会话、处理cookies等。
异常处理: requests的异常处理更加直观和统一。

然而,urllib作为标准库仍然有其优势:

无需安装: 作为标准库,urllib无需额外安装即可使用。
底层操作: urllib提供了更多的底层操作,在某些特殊情况下可能更有优势。

在大多数情况下,如果你的项目允许使用第三方库,requests可能是更好的选择。但了解和掌握urllib仍然很有必要,因为它是Python网络编程的基础,而且在一些特殊情况下可能会更有用。

8. 注意事项

在使用urllib进行爬虫时,有一些重要的注意事项:

遵守robots.txt: 使用urllib.robotparser解析robots.txt文件,遵守网站的爬取规则。
添加合适的User-Agent: 在headers中添加合适的User-Agent,避免被网站识别为爬虫而被封禁。
控制爬取速度: 添加适当的延时,避免对目标网站造成过大压力。
处理异常: 正确处理可能出现的网络异常和HTTP错误。
解码响应: 注意正确解码响应内容,处理不同的字符编码。
URL编码: 在构造URL时,注意对参数进行正确的URL编码。

总结

在本文中，我们学习了Python标准库urllib的使用方法，包括发送GET和POST请求、异常处理、URL解析和构造，以及robots.txt文件解析，并将这些知识应用到了实际的爬虫案例中。
虽然requests库在实际开发中更受欢迎，但掌握urllib仍然十分重要。它不仅是Python网络编程的基础，而且在某些特殊情况下可能会更有优势。
希望通过本文，你对urllib有了更深入的理解，并能在你的爬虫项目中灵活运用。无论使用何种工具，都要遵守网络爬虫的伦理规范，尊重网站的规则和其他用户的权益。

如果你有任何问题或者好的想法，欢迎随时和我交流。

Python 爬虫入门（六）：urllib库的使用方法

Python 爬虫入门（六）：urllib库的使用方法

前言

1. urllib 概述

2. urllib.request 模块

2.1 发送GET请求

2.2 发送POST请求

2.3 添加headers

2.4 处理异常

3. urllib.error 模块

4. urllib.parse 模块

4.1 URL解析

4.2 URL编码和解码

4.3 拼接URL

5. urllib.robotparser 模块

6. 实战示例: 爬取豆瓣电影Top250

7. urllib vs requests

8. 注意事项

总结

最新新闻

热搜词