您的位置:首页 > 汽车 > 时评 > Python爬虫(6) --深层爬取

Python爬虫(6) --深层爬取

2024/11/18 16:35:56 来源:https://blog.csdn.net/m0_74896766/article/details/140674463  浏览:    关键词:Python爬虫(6) --深层爬取

深层爬取

在前面几篇的内容中,我们都是爬取网页表面的信息,这次我们通过表层内容,深度爬取内部数据。

接着按照之前的步骤,我们先访问表层页面:

  1. 指定url
  2. 发送请求
  3. 获取你想要的数据
  4. 数据解析

我们试着将以下豆瓣读书页面的书籍进一步爬取:

https://book.douban.com/tag/%E4%BA%92%E8%81%94%E7%BD%91

在这里插入图片描述

在浏览器点击这本书,我们要通过这个页面进去这本书的详细页面爬取它的详细信息,它的详细页面链接在href标签中。

爬取

指定url

url = "https://book.douban.com/tag/%E4%BA%92%E8%81%94%E7%BD%91"

发送请求

import fake_useragent
import requests
head = {"User-Agent": fake_useragent.UserAgent().random
}
resp = requests.get(url, headers=head)

获取想要的数据

from lxml import etree
res_text = resp.text

数据解析

tree = etree.HTML(res_text)
a_list = tree.xpath("//ul[@class='subject-list']/li/div[2]/h2/a")

定位到位置之后,我们要取到具体的href标签中的链接,再次进行访问请求:

在这里插入图片描述

进入深层

爬取这个详细页面的内容:

for a in a_list:# 1、urlbook_url = "".join(a.xpath("./@href"))# 2、发送请求book_res = requests.get(book_url, headers=head)# 3、获取想要的信息book_text = book_res.text# 4、数据解析book_tree = etree.HTML(book_text)book_name = "".join(book_tree.xpath("//span[@property='v:itemreviewed']/text()"))author = "".join(book_tree.xpath("//div[@class='subject clearfix']/div[2]/span[1]/a/text()"))publish = "".join(book_tree.xpath("//div[@class='subject clearfix']/div[2]/a[1]/text()"))y = "".join(book_tree.xpath("//span[@class='pl' and text()='出版年:']/following-sibling::text()[1]"))page = "".join(book_tree.xpath("//span[@class='pl' and text()='页数:']/following-sibling::text()[1]"))price = "".join(book_tree.xpath("//span[@class='pl' and text()='定价:']/following-sibling::text()[1]"))bind = "".join(book_tree.xpath("//span[@class='pl' and text()='装帧:']/following-sibling::text()[1]"))isbn = "".join(book_tree.xpath("//span[@class='pl' and text()='ISBN:']/following-sibling::text()[1]"))

完整代码显示

# 通过表层内容 深度爬取内部数据
import time
import fake_useragent
import requests
from lxml import etreehead = {"User-Agent": fake_useragent.UserAgent().random
}if __name__ == '__main__':# 1、urlurl = "https://book.douban.com/tag/%E4%BA%92%E8%81%94%E7%BD%91"# 2、发送请求resp = requests.get(url, headers=head)time.sleep(5)  #请求时停留5秒,不然请求太快可能会被网页拒绝# 3、获取想要的数据res_text = resp.text# print(res_text)# 4、数据解析tree = etree.HTML(res_text)a_list = tree.xpath("//ul[@class='subject-list']/li/div[2]/h2/a")for a in a_list:time.sleep(3)# 1、urlbook_url = "".join(a.xpath("./@href"))# 2、发送请求book_res = requests.get(book_url, headers=head)# 3、获取想要的信息book_text = book_res.text# 4、数据解析book_tree = etree.HTML(book_text)book_name = "".join(book_tree.xpath("//span[@property='v:itemreviewed']/text()"))author = "".join(book_tree.xpath("//div[@class='subject clearfix']/div[2]/span[1]/a/text()"))publish = "".join(book_tree.xpath("//div[@class='subject clearfix']/div[2]/a[1]/text()"))y = "".join(book_tree.xpath("//span[@class='pl' and text()='出版年:']/following-sibling::text()[1]"))page = "".join(book_tree.xpath("//span[@class='pl' and text()='页数:']/following-sibling::text()[1]"))price = "".join(book_tree.xpath("//span[@class='pl' and text()='定价:']/following-sibling::text()[1]"))bind = "".join(book_tree.xpath("//span[@class='pl' and text()='装帧:']/following-sibling::text()[1]"))isbn = "".join(book_tree.xpath("//span[@class='pl' and text()='ISBN:']/following-sibling::text()[1]"))print(book_name, author, publish, y, page, price, bind, isbn)# print(a_list)pass

总结

其实与爬取视频的操作相差不大,先定位页面位置,再找到深层页面的链接,获取想要的信息。

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com