莱芜金点子电子版最新招聘信息_b站免费版2023最新版本_国家再就业免费培训网_东莞网站定制开发

文章目录

爬虫
- Requests
- BeautifulSoup
- Pandas
- Selenium

爬虫

爬虫可以分为请求，解析，存储三个过程

Requests

针对静态网页，可以直接用Requests获取页面html信息

这里主要关注requests和response对象

使用

pip install requestsimport requests as rqurl=""
data={}
headers={}
# 携带参数
response=rq.get()
response=rq.post()#返回对象常用的几个参数
print(response.encoding)
response.encoding="utf-8"    #更改为utf-8编码
print(response.status_code)  # 打印状态码
print(response.url)          # 打印请求url
print(response.headers)      # 打印头信息
print(response.cookies)      # 打印cookie信息
print(response.text)  #以字符串形式打印网页源码
print(response.content) #以字节流形式打印

BeautifulSoup

将Html转化为Python对象的库

使用

主要关注BeautifulSoup对象，Tag对象

pip install beautifulsoup4from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser') //构造对应BeautifulSoup对象soup.prettify() //格式化输出html#只关心查找相关方法，虽然BeautifulSoup也提供修改能力
# BeautifulSoup对象中常用方法
find_all( name , attrs , recursive , string , **kwargs ) 
// name 可传入标签名或者正则表达式,attrs可根据标签属性来查找# Tag对象中常用属性及方法
.contents //将tag的子节点以列表形式输出 ,注意Tag对象无语法提示，但确实存在contents和children属性
for child in title_tag.children:  //遍历tag子节点print(child)tag['class'] //获取tag中指定属性值tag.text //获取tag中文本值

Pandas

方便数据处理和分析的一个库

使用

主要关注DataFrame和Series对象，pandas.ExcelWriter 是 Pandas 库中用于将 DataFrame 对象写入 Excel 文件的类。通过 ExcelWriter，用户可以灵活地控制写入过程，例如选择写入引擎、设置日期格式、选择写入模式（覆盖或追加）

参考：https://gairuo.com/p/pandas-data-cleaning

pip install pandasimport pandas as pd
# 构造示例数据
data_existing = {'姓名': ['赵六', '周七'],'年龄': [28, 32],'城市': ['北京', '上海']
}
df_existing = pd.DataFrame(data_existing)data_new = {'姓名': ['孙八', '吴九'],'年龄': [26, 29],'城市': ['广州', '深圳']
}
df_new = pd.DataFrame(data_new)# 首先，创建一个包含初始数据的 Excel 文件
with pd.ExcelWriter('output_combined.xlsx', mode='w', engine='openpyxl') as writer:df_existing.to_excel(writer, sheet_name='人员信息', index=False)# 使用 ExcelWriter 追加新的数据到同一个工作表
with pd.ExcelWriter('output_combined.xlsx', mode='a', engine='openpyxl', if_sheet_exists='overlay') as writer:# 获取现有的人员信息工作表book = writer.bookwriter.sheets = {ws.title: ws for ws in book.worksheets}# 追加新的数据到 '人员信息' 工作表startrow = writer.sheets['人员信息'].max_rowdf_new.to_excel(writer, sheet_name='人员信息', index=False, header=False, startrow=startrow)

Selenium

一个自动化操纵浏览器的框架,针对Ajax请求动态网页抓取数据场景很好用

官网：https://www.selenium.dev/zh-cn/documentation/webdriver/getting_started/

安装

1.pip install selenium

2.下载对应浏览器driver 驱动

https://developer.chrome.google.cn/docs/chromedriver/downloads?hl=zh-cn

114版本以上驱动

https://googlechromelabs.github.io/chrome-for-testing/

使用

重点关注WebDriver类，Option和Service

基本操作

策略	就绪状态	备注
normal	complete	默认值, 等待所有资源下载
eager	interactive	DOM 访问已准备就绪, 但诸如图像的其他资源可能仍在加载
none	Any	完全不会阻塞 WebDriver

# 浏览器配置
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal' #加载策略options.add_argument("--headless") # 开启无界面模式options.add_experimental_option("detach", True) //浏览器默认不关闭
# 浏览器服务
service = webdriver.ChromeService(port=1234)service = webdriver.ChromeService(log_output=log_path)  //指定日志输出
service = webdriver.ChromeService(service_args=['--append-log', '--readable-timestamp'], log_output=log_path)# 等待
driver.implicitly_wait() //全局配置 隐式等待
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECwd = webdriver.Chrome()
wd.get('http://www.baidu.com')
#wd是webdriver对象，10是最长等待时间，0.5是每0.5秒去查询对应的元素。until后面跟的等待具体条件，EC是判断条件，检查元素是否存在于页面的 DOM 上。
login_btn=WebDriverWait(wd,10,0.5).until(EC.presence_of_element_located((By.ID, "s-top-loginbtn")))
login_btn.click()# 元素相关
driver.find_element() //find系列方法找到对应元素
is_displayed()
is_enabled()
is_selected()
get_attribute() //获取属性值
text属性 //获取文本值# 退出会话
driver.quit()

Cookie处理

import selenium.webdriver as webdriver# 获取Cookie存放本地driver=webdriver.Chrome()driver.get("http://www.baidu.com")cookies=driver.get_cookies()with open("cookie.txt","w") as f:json.dump(cookies,f)	# 请求携带Cookiewith open("cookie.txt","r")as f:existCookies=json.load(f)for cookie in existCookies:driver.add_cookie(cookie) //只能一个字典一个字典的添加driver.get("http://www.baidu.com")

ps:启动后如果一直没有弹出浏览器且console里未显示任何，有两种可能，一种驱动版本不匹配，另一种更换4.1.1版本selenium再尝试