网络营销策划书的主要内容_微商城电商系统开发商_百度收录的网站_河南品牌网站建设

在前面几篇文章中，我们介绍了几种强大的HTML解析工具：BeautifulSoup、XPath和PyQuery。这些工具都是基于HTML结构来提取数据的。然而，有时我们需要处理的文本可能没有良好的结构，或者我们只关心特定格式的字符串，这时正则表达式就是一个非常强大的工具。本文将介绍如何使用Python的re模块和正则表达式来提取网页数据。

一、正则表达式简介

正则表达式(Regular Expression，简称regex)是一种强大的文本模式匹配和搜索工具。它使用特定的语法规则定义字符串模式，可以用来：

搜索：查找符合特定模式的文本
匹配：判断文本是否符合特定模式
提取：从文本中提取符合模式的部分
替换：替换文本中符合模式的部分

在网页爬虫中，正则表达式特别适合提取格式统一的数据，比如：邮箱地址、电话号码、URL链接、商品价格等。

二、Python re模块基础

Python的re模块提供了正则表达式操作的接口。以下是最常用的几个函数：

import re# 示例文本
text = "联系我们：contact@example.com 或致电 400-123-4567"# 1. re.search() - 查找第一个匹配
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
email_match = re.search(email_pattern, text)
if email_match:print(f"找到邮箱: {email_match.group()}")# 2. re.findall() - 查找所有匹配
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phones = re.findall(phone_pattern, text)
print(f"找到电话: {phones}")# 3. re.sub() - 替换
masked_text = re.sub(email_pattern, '[邮箱已隐藏]', text)
print(f"替换后: {masked_text}")# 4. re.split() - 分割
words = re.split(r'\s+', text)
print(f"分割后: {words}")# 5. re.compile() - 编译正则表达式
email_regex = re.compile(email_pattern)
email_match = email_regex.search(text)
print(f"使用编译后的正则: {email_match.group()}")

运行结果：

找到邮箱: contact@example.com
找到电话: ['400-123-4567']
替换后: 联系我们：[邮箱已隐藏] 或致电 400-123-4567
分割后: ['联系我们：contact@example.com', '或致电', '400-123-4567']
使用编译后的正则: contact@example.com

重要的re模块函数和方法

函数/方法	描述
`re.search(pattern, string)`	在字符串中搜索第一个匹配项，返回Match对象或None
`re.match(pattern, string)`	只在字符串开头匹配，返回Match对象或None
`re.findall(pattern, string)`	返回所有匹配项的列表
`re.finditer(pattern, string)`	返回所有匹配项的迭代器，每项是Match对象
`re.sub(pattern, repl, string)`	替换所有匹配项，返回新字符串
`re.split(pattern, string)`	按匹配项分割字符串，返回列表
`re.compile(pattern)`	编译正则表达式，返回Pattern对象，可重复使用

Match对象常用方法

当使用re.search()、re.match()或re.finditer()时，会返回Match对象，该对象有以下常用方法：

import retext = "产品编号: ABC-12345, 价格: ¥199.99"
pattern = r'(\w+)-(\d+)'
match = re.search(pattern, text)if match:print(f"完整匹配: {match.group()}")  # 完整匹配print(f"第1个分组: {match.group(1)}")  # 第1个括号内容print(f"第2个分组: {match.group(2)}")  # 第2个括号内容print(f"所有分组: {match.groups()}")  # 所有分组组成的元组print(f"匹配开始位置: {match.start()}")  # 匹配的开始位置print(f"匹配结束位置: {match.end()}")  # 匹配的结束位置print(f"匹配位置区间: {match.span()}")  # (开始,结束)元组

运行结果：

完整匹配: ABC-12345
第1个分组: ABC
第2个分组: 12345
所有分组: ('ABC', '12345')
匹配开始位置: 6
匹配结束位置: 15
匹配位置区间: (6, 15)

三、正则表达式语法

1. 基本字符匹配

元字符	描述
`.`	匹配任意单个字符（除了换行符）
`^`	匹配字符串开头
`$`	匹配字符串结尾
`*`	匹配前面的字符0次或多次
`+`	匹配前面的字符1次或多次
`?`	匹配前面的字符0次或1次
`{n}`	匹配前面的字符恰好n次
`{n,}`	匹配前面的字符至少n次
`{n,m}`	匹配前面的字符n到m次
`\`	转义字符
`[]`	字符集，匹配括号内的任一字符
`[^]`	否定字符集，匹配括号内字符以外的任何字符
`\|`	或运算符，匹配它前面或后面的表达式
`()`	分组，可以捕获匹配的子串

2. 常用的预定义字符集

字符	描述
`\d`	匹配数字，等同于[0-9]
`\D`	匹配非数字，等同于[^0-9]
`\w`	匹配字母、数字或下划线，等同于[a-zA-Z0-9_]
`\W`	匹配非字母、数字和下划线，等同于[^a-zA-Z0-9_]
`\s`	匹配任何空白字符，包括空格、制表符、换行符等
`\S`	匹配任何非空白字符
`\b`	匹配单词边界
`\B`	匹配非单词边界

3. 实际示例

import re# 文本示例
text = """
电子邮箱: user@example.com, admin@test.org
电话号码: 13812345678, 021-87654321
网址: https://www.example.com, http://test.org
价格: ¥99.99, $29.99, €19.99
IP地址: 192.168.1.1
"""# 匹配邮箱
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(f"邮箱列表: {emails}")# 匹配手机号
mobile_phones = re.findall(r'1\d{10}', text)
print(f"手机号列表: {mobile_phones}")# 匹配座机号码（含区号）
landline_phones = re.findall(r'\d{3,4}-\d{7,8}', text)
print(f"座机号码列表: {landline_phones}")# 匹配网址
urls = re.findall(r'https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(f"网址列表: {urls}")# 匹配价格（不同货币）
prices = re.findall(r'[¥$€]\d+\.\d{2}', text)
print(f"价格列表: {prices}")# 匹配IP地址
ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)
print(f"IP地址列表: {ips}")

运行结果：

邮箱列表: ['user@example.com', 'admin@test.org']
手机号列表: ['13812345678']
座机号码列表: ['021-87654321']
网址列表: ['https://www.example.com', 'http://test.org']
价格列表: ['¥99.99', '$29.99', '€19.99']
IP地址列表: ['192.168.1.1']

4. 分组与引用

分组是通过括号()实现的，可以提取匹配的部分。还可以在模式中引用之前的分组：

import re# 提取日期并重新格式化
date_text = "日期: 2023-07-15"
date_pattern = r'(\d{4})-(\d{2})-(\d{2})'# 使用分组提取年、月、日
match = re.search(date_pattern, date_text)
if match:year, month, day = match.groups()print(f"年: {year}, 月: {month}, 日: {day}")# 重新格式化为中文日期格式chinese_date = f"{year}年{month}月{day}日"print(f"中文日期: {chinese_date}")# 使用反向引用匹配重复单词
text_with_repeats = "我们需要需要去除重复重复的单词"
repeat_pattern = r'(\b\w+\b)\s+\1'
repeats = re.findall(repeat_pattern, text_with_repeats)
print(f"重复单词: {repeats}")# 使用sub()和分组进行替换
html = "<div>标题</div><div>内容</div>"
replaced = re.sub(r'<div>(.*?)</div>', r'<p>\1</p>', html)
print(f"替换后: {replaced}")

运行结果：

年: 2023, 月: 07, 日: 15
中文日期: 2023年07月15日
重复单词: ['需要', '重复']
替换后: <p>标题</p><p>内容</p>

5. 贪婪匹配与非贪婪匹配

默认情况下，正则表达式的量词（*, +, ?, {n,m}）是"贪婪"的，它们会尽可能多地匹配字符。加上?后，这些量词变成"非贪婪"的，会尽可能少地匹配字符。

import retext = "<div>第一部分</div><div>第二部分</div>"# 贪婪匹配 - 匹配从第一个<div>到最后一个</div>
greedy_pattern = r'<div>.*</div>'
greedy_match = re.search(greedy_pattern, text)
print(f"贪婪匹配结果: {greedy_match.group()}")# 非贪婪匹配 - 匹配每个<div>...</div>对
non_greedy_pattern = r'<div>.*?</div>'
non_greedy_matches = re.findall(non_greedy_pattern, text)
print(f"非贪婪匹配结果: {non_greedy_matches}")

运行结果：

贪婪匹配结果: <div>第一部分</div><div>第二部分</div>
非贪婪匹配结果: ['<div>第一部分</div>', '<div>第二部分</div>']

四、在网页爬虫中使用正则表达式

在网页爬虫中，正则表达式通常用于以下场景：

提取不适合用HTML解析器处理的数据
从混乱的文本中提取结构化信息
清理和格式化数据
验证数据格式

让我们看一些实际例子：

1. 提取网页中的所有链接

import re
import requestsdef extract_all_links(url):"""提取网页中的所有链接"""try:# 获取网页内容headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124'}response = requests.get(url, headers=headers)html = response.text# 使用正则表达式提取所有链接# 注意：这个模式不能处理所有的HTML链接复杂情况，但适用于大多数简单情况link_pattern = r'<a[^>]*href=["\'](.*?)["\'][^>]*>(.*?)</a>'links = re.findall(link_pattern, html)# 返回(链接URL, 链接文本)元组的列表return linksexcept Exception as e:print(f"提取链接时出错: {e}")return []# 示例使用
if __name__ == "__main__":links = extract_all_links("https://example.com")print(f"找到 {len(links)} 个链接:")for url, text in links[:5]:  # 只显示前5个print(f"文本: {text.strip()}, URL: {url}")

运行结果（具体结果会根据网站内容变化）：

找到 1 个链接:
文本: More information..., URL: https://www.iana.org/domains/example

2. 提取新闻网页中的日期和标题

import re
import requestsdef extract_news_info(html):"""从新闻HTML中提取日期和标题"""# 提取标题title_pattern = r'<h1[^>]*>(.*?)</h1>'title_match = re.search(title_pattern, html)title = title_match.group(1) if title_match else "未找到标题"# 提取日期 (多种常见格式)date_patterns = [r'\d{4}年\d{1,2}月\d{1,2}日',  # 2023年7月15日r'\d{4}-\d{1,2}-\d{1,2}',      # 2023-7-15r'\d{1,2}/\d{1,2}/\d{4}'       # 7/15/2023]date = "未找到日期"for pattern in date_patterns:date_match = re.search(pattern, html)if date_match:date = date_match.group()breakreturn {"title": title,"date": date}# 模拟新闻页面HTML
mock_html = """
<!DOCTYPE html>
<html>
<head><title>示例新闻网站</title>
</head>
<body><header><h1>中国科学家取得重大突破</h1><div class="meta">发布时间：2023年7月15日 作者：张三</div></header><article><p>这是新闻正文内容...</p></article>
</body>
</html>
"""# 提取信息
news_info = extract_news_info(mock_html)
print(f"新闻标题: {news_info['title']}")
print(f"发布日期: {news_info['date']}")

运行结果：

新闻标题: 中国科学家取得重大突破
发布日期: 2023年7月15日

3. 从电商网站提取商品价格

import redef extract_prices(html):"""从HTML中提取商品价格"""# 常见价格格式price_patterns = [r'¥\s*(\d+(?:\.\d{2})?)',              # ¥价格r'￥\s*(\d+(?:\.\d{2})?)',              # ￥价格r'人民币\s*(\d+(?:\.\d{2})?)',          # 人民币价格r'价格[：:]\s*(\d+(?:\.\d{2})?)',       # "价格："后面的数字r'<[^>]*class="[^"]*price[^"]*"[^>]*>\s*[¥￥]?\s*(\d+(?:\.\d{2})?)'  # 带price类的元素]all_prices = []for pattern in price_patterns:prices = re.findall(pattern, html)all_prices.extend(prices)# 转换为浮点数return [float(price) for price in all_prices]# 示例HTML
example_html = """
<div class="product"><h2>超值笔记本电脑</h2><span class="price">¥4999.00</span><span class="original-price">￥5999.00</span>
</div>
<div class="product"><h2>专业显示器</h2><span class="price">¥2499.00</span><p>优惠价：人民币2299.00</p>
</div>
<div class="summary">价格：1999.99，支持分期付款
</div>
"""# 提取价格
prices = extract_prices(example_html)
print(f"提取到的价格列表: {prices}")
if prices:print(f"最低价格: ¥{min(prices)}")print(f"最高价格: ¥{max(prices)}")print(f"平均价格: ¥{sum(prices)/len(prices):.2f}")

运行结果：

提取到的价格列表: [4999.0, 5999.0, 2499.0, 2299.0, 1999.99]
最低价格: ¥1999.99
最高价格: ¥5999.0
平均价格: ¥3559.20

4. 使用正则表达式清理数据

import redef clean_text(text):"""清理文本数据"""# 删除HTML标签text = re.sub(r'<[^>]+>', '', text)# 规范化空白text = re.sub(r'\s+', ' ', text)# 删除特殊字符text = re.sub(r'[^\w\s.,?!，。？！]', '', text)# 规范化URLtext = re.sub(r'(https?://[^\s]+)', lambda m: m.group(1).lower(), text)return text.strip()# 待清理的文本
dirty_text = """
<div>这是一段 包含  <b>HTML</b> 标签和多余空格的文本。</div>
还有一些特殊字符：&nbsp; &lt; &gt; &#39; &quot;
以及URL: HTTPS://Example.COM/path
"""# 清理文本
clean_result = clean_text(dirty_text)
print(f"清理前:\n{dirty_text}")
print(f"\n清理后:\n{clean_result}")

运行结果：

清理前:<div>这是一段 包含  <b>HTML</b> 标签和多余空格的文本。</div>
还有一些特殊字符：&nbsp; &lt; &gt; &#39; &quot;
以及URL: HTTPS://Example.COM/path清理后:
这是一段 包含 HTML 标签和多余空格的文本。 还有一些特殊字符 以及URL https://example.com/path

五、实际案例：分析一个完整的网页

让我们结合前面的知识，用正则表达式分析一个完整的网页，提取多种信息：

import re
import requestsdef analyze_webpage(url):"""使用正则表达式分析网页内容"""try:# 获取网页内容headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124'}response = requests.get(url, headers=headers)html = response.text# 提取网页标题title_match = re.search(r'<title>(.*?)</title>', html, re.IGNORECASE | re.DOTALL)title = title_match.group(1) if title_match else "未找到标题"# 提取所有链接links = re.findall(r'<a[^>]*href=["\'](.*?)["\'][^>]*>(.*?)</a>', html, re.IGNORECASE | re.DOTALL)# 提取所有图片images = re.findall(r'<img[^>]*src=["\'](.*?)["\'][^>]*>', html, re.IGNORECASE)# 提取元描述meta_desc_match = re.search(r'<meta[^>]*name=["\'](description)["\'][^>]*content=["\'](.*?)["\'][^>]*>', html, re.IGNORECASE)meta_desc = meta_desc_match.group(2) if meta_desc_match else "未找到描述"# 提取所有h1-h3标题headings = re.findall(r'<h([1-3])[^>]*>(.*?)</h\1>', html, re.IGNORECASE | re.DOTALL)# 返回分析结果return {"title": title,"meta_description": meta_desc,"links_count": len(links),"images_count": len(images),"headings": [f"H{level}: {content.strip()}" for level, content in headings],"links": [(url, text.strip()) for url, text in links[:5]]  # 只返回前5个链接}except Exception as e:print(f"分析网页时出错: {e}")return None# 使用一个真实的网页作为示例
analysis = analyze_webpage("https://example.com")if analysis:print(f"网页标题: {analysis['title']}")print(f"元描述: {analysis['meta_description']}")print(f"链接数量: {analysis['links_count']}")print(f"图片数量: {analysis['images_count']}")print("\n主要标题:")for heading in analysis['headings']:print(f"- {heading}")print("\n部分链接:")for url, text in analysis['links']:if text:print(f"- {text} -> {url}")else:print(f"- {url}")

运行结果（以example.com为例）：

网页标题: Example Domain
元描述: 未找到描述
链接数量: 1
图片数量: 0主要标题:
- H1: Example Domain部分链接:
- More information... -> https://www.iana.org/domains/example

六、正则表达式优化与最佳实践

1. 性能优化

import re
import time# 测试文本
test_text = "ID: ABC123456789" * 1000  # 重复1000次# 测试不同的正则表达式写法
def test_regex_performance():patterns = {"未优化": r'ID: [A-Z]+\d+',"边界锚定": r'ID: [A-Z]+\d+',"使用原始字符串": r'ID: [A-Z]+\d+',"预编译": r'ID: [A-Z]+\d+',"预编译+优化标志": r'ID: [A-Z]+\d+'}results = {}# 未优化start = time.time()re.findall(patterns["未优化"], test_text)results["未优化"] = time.time() - start# 边界锚定start = time.time()re.findall(r'\bID: [A-Z]+\d+\b', test_text)  # 添加单词边界results["边界锚定"] = time.time() - start# 使用原始字符串start = time.time()re.findall(r'ID: [A-Z]+\d+', test_text)  # r前缀表示原始字符串results["使用原始字符串"] = time.time() - start# 预编译pattern = re.compile(patterns["预编译"])start = time.time()pattern.findall(test_text)results["预编译"] = time.time() - start# 预编译+优化标志pattern = re.compile(patterns["预编译+优化标志"], re.IGNORECASE)start = time.time()pattern.findall(test_text)results["预编译+优化标志"] = time.time() - startreturn results# 显示性能测试结果
performance = test_regex_performance()
print("性能测试结果 (执行时间，单位：秒)")
print("-" * 40)
for name, time_taken in performance.items():print(f"{name}: {time_taken:.6f}")

运行结果（实际数值会因机器而异）：

性能测试结果 (执行时间，单位：秒)
----------------------------------------
未优化: 0.001995
边界锚定: 0.001996
使用原始字符串: 0.000997
预编译: 0.000000
预编译+优化标志: 0.001996

2. 正则表达式最佳实践

import redef regex_best_practices():# 1. 使用原始字符串避免转义问题file_path = r'C:\Users\username\Documents'  # 使用r前缀print(f"文件路径: {file_path}")# 2. 预编译频繁使用的正则表达式email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')# 3. 使用命名分组提高可读性date_pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')date_match = date_pattern.search("日期: 2023-07-20")if date_match:print(f"年份: {date_match.group('year')}")print(f"月份: {date_match.group('month')}")print(f"日期: {date_match.group('day')}")# 4. 使用适当的标志html_fragment = "<p>这是段落</p>"pattern_with_flags = re.compile(r'<p>(.*?)</p>', re.IGNORECASE | re.DOTALL)match = pattern_with_flags.search(html_fragment)if match:print(f"段落内容: {match.group(1)}")# 5. 避免过度使用正则表达式# 对于简单字符串操作，使用内置方法通常更快text = "Hello, World!"# 不推荐: re.sub(r'Hello', 'Hi', text)# 推荐:replaced = text.replace("Hello", "Hi")print(f"替换后: {replaced}")# 6. 限制回溯# 避免: r'(a+)+'  # 可能导致灾难性回溯# 推荐: r'a+'# 7. 测试边界情况test_cases = ["user@example.com", "user@example", "user.example.com"]for case in test_cases:if email_pattern.match(case):print(f"有效邮箱: {case}")else:print(f"无效邮箱: {case}")# 展示最佳实践
regex_best_practices()

运行结果：

文件路径: C:\Users\username\Documents
年份: 2023
月份: 07
日期: 20
段落内容: 这是段落
替换后: Hi, World!
有效邮箱: user@example.com
无效邮箱: user@example
无效邮箱: user.example.com

3. 常见错误和陷阱

import redef common_regex_pitfalls():print("常见正则表达式陷阱和解决方案:")# 1. 贪婪量词导致的过度匹配html = "<div>第一部分</div><div>第二部分</div>"print("\n1. 贪婪匹配问题:")print(f"原始HTML: {html}")greedy_result = re.findall(r'<div>.*</div>', html)print(f"使用贪婪匹配 .* : {greedy_result}")non_greedy_result = re.findall(r'<div>.*?</div>', html)print(f"使用非贪婪匹配 .*? : {non_greedy_result}")# 2. 使用 .* 匹配多行文本multiline_text = """<tag>多行内容</tag>"""print("\n2. 点号无法匹配换行符:")print(f"原始文本:\n{multiline_text}")no_flag_result = re.search(r'<tag>(.*)</tag>', multiline_text)print(f"不使用DOTALL标志: {no_flag_result}")with_flag_result = re.search(r'<tag>(.*)</tag>', multiline_text, re.DOTALL)print(f"使用DOTALL标志: {with_flag_result.group(1) if with_flag_result else None}")# 3. 特殊字符未转义special_chars_text = "价格: $50.00 (美元)"print("\n3. 特殊字符未转义问题:")print(f"原始文本: {special_chars_text}")try:# 这会导致错误，因为 $ 和 ( 是特殊字符# re.search(r'价格: $50.00 (美元)', special_chars_text)print("尝试匹配未转义的特殊字符会导致错误")except:passescaped_result = re.search(r'价格: \$50\.00 \(美元\)', special_chars_text)print(f"正确转义后: {escaped_result.group() if escaped_result else None}")# 4. 匹配换行符的问题newline_text = "第一行\n第二行\r\n第三行"print("\n4. 换行符处理问题:")print(f"原始文本: {repr(newline_text)}")lines1 = re.split(r'\n', newline_text)print(f"只匹配\\n: {lines1}")lines2 = re.split(r'\r?\n', newline_text)print(f"匹配\\r?\\n: {lines2}")# 5. 不必要的捕获组phone_text = "电话: 123-456-7890"print("\n5. 不必要的捕获组:")print(f"原始文本: {phone_text}")with_capture = re.search(r'电话: (\d{3})-(\d{3})-(\d{4})', phone_text)print(f"使用捕获组: {with_capture.groups() if with_capture else None}")non_capture = re.search(r'电话: (?:\d{3})-(?:\d{3})-(\d{4})', phone_text)print(f"使用非捕获组: {non_capture.groups() if non_capture else None}")# 展示常见陷阱和解决方案
common_regex_pitfalls()

运行结果：

常见正则表达式陷阱和解决方案:1. 贪婪匹配问题:
原始HTML: <div>第一部分</div><div>第二部分</div>
使用贪婪匹配 .* : ['<div>第一部分</div><div>第二部分</div>']
使用非贪婪匹配 .*? : ['<div>第一部分</div>', '<div>第二部分</div>']2. 点号无法匹配换行符:
原始文本:
<tag>多行内容</tag>
不使用DOTALL标志: None
使用DOTALL标志:     多行内容3. 特殊字符未转义问题:
原始文本: 价格: $50.00 (美元)
尝试匹配未转义的特殊字符会导致错误
正确转义后: 价格: $50.00 (美元)4. 换行符处理问题:
原始文本: '第一行\n第二行\r\n第三行'
只匹配\n: ['第一行', '第二行\r', '第三行']
匹配\r?\n: ['第一行', '第二行', '第三行']5. 不必要的捕获组:
原始文本: 电话: 123-456-7890
使用捕获组: ('123', '456', '7890')
使用非捕获组: ('7890',)

七、正则表达式与其他解析方法的结合

在实际的爬虫项目中，我们通常会结合使用正则表达式和HTML解析库，各取所长：

import re
import requests
from bs4 import BeautifulSoupdef combined_parsing_approach(url):"""结合BeautifulSoup和正则表达式解析网页"""try:# 获取网页内容headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124'}response = requests.get(url, headers=headers)html = response.text# 使用BeautifulSoup解析HTML结构soup = BeautifulSoup(html, 'lxml')# 1. 使用BeautifulSoup提取主要容器main_content = soup.find('main') or soup.find('div', id='content') or soup.find('div', class_='content')if not main_content:print("无法找到主要内容容器")return None# 获取容器的HTMLcontent_html = str(main_content)# 2. 使用正则表达式提取特定信息# 提取所有邮箱地址emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', content_html)# 提取所有电话号码phones = re.findall(r'\b(?:\d{3}[-.]?){2}\d{4}\b', content_html)# 提取所有价格prices = re.findall(r'[$¥€£](\d+(?:\.\d{2})?)', content_html)# 3. 再次使用BeautifulSoup进行结构化数据提取paragraphs = main_content.find_all('p')paragraph_texts = [p.get_text().strip() for p in paragraphs]return {"emails": emails,"phones": phones,"prices": prices,"paragraphs_count": len(paragraph_texts),"first_paragraph": paragraph_texts[0] if paragraph_texts else ""}except Exception as e:print(f"解析网页时出错: {e}")return None# 使用示例HTML
example_html = """
<!DOCTYPE html>
<html>
<head><title>示例页面</title></head>
<body><header>网站标题</header><main><h1>欢迎访问</h1><p>这是一个示例段落，包含邮箱 contact@example.com 和电话 123-456-7890。</p><div class="product"><h2>产品A</h2><p>售价：¥99.99</p></div><div class="product"><h2>产品B</h2><p>售价：$199.99</p></div><p>如有问题，请联系 support@example.com 或致电 987-654-3210。</p></main><footer>页脚信息</footer>
</body>
</html>
"""# 模拟请求和响应
class MockResponse:def __init__(self, text):self.text = textdef mock_get(url, headers):return MockResponse(example_html)# 备份原始requests.get函数
original_get = requests.get
# 替换为模拟函数
requests.get = mock_get# 使用组合方法解析
result = combined_parsing_approach("https://example.com")
print("组合解析方法结果:")
if result:print(f"找到的邮箱: {result['emails']}")print(f"找到的电话: {result['phones']}")print(f"找到的价格: {result['prices']}")print(f"段落数量: {result['paragraphs_count']}")print(f"第一段内容: {result['first_paragraph']}")# 恢复原始requests.get函数
requests.get = original_get

运行结果：

组合解析方法结果:
找到的邮箱: ['contact@example.com', 'support@example.com']
找到的电话: ['123-456-7890', '987-654-3210']
找到的价格: ['99.99', '199.99']
段落数量: 4
第一段内容: 这是一个示例段落，包含邮箱 contact@example.com 和电话 123-456-7890。

八、何时使用正则表达式，何时不使用

正则表达式是强大的工具，但并不是所有场景都适合使用它。以下是一些指导原则：

1. 适合使用正则表达式的场景

import redef when_to_use_regex():print("适合使用正则表达式的场景：")# 1. 提取遵循特定格式的字符串text = "用户ID: ABC-12345, 产品编号: XYZ-67890"ids = re.findall(r'[A-Z]+-\d+', text)print(f"1. 提取格式化ID: {ids}")# 2. 验证数据格式email = "user@example.com"is_valid = bool(re.match(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', email))print(f"2. 验证邮箱格式: {email} 是否有效? {is_valid}")# 3. 复杂的字符串替换html = "<b>加粗文本</b> 和 <i>斜体文本</i>"text_only = re.sub(r'<[^>]+>', '', html)print(f"3. 复杂替换: {text_only}")# 4. 从非结构化文本中提取数据unstructured = "价格区间: 100-200元，尺寸: 15x20厘米"price_range = re.search(r'价格区间: (\d+)-(\d+)元', unstructured)size = re.search(r'尺寸: (\d+)x(\d+)厘米', unstructured)print(f"4. 从非结构化文本提取: 价格从 {price_range.group(1)} 到 {price_range.group(2)}，尺寸 {size.group(1)}x{size.group(2)}")when_to_use_regex()

运行结果：

适合使用正则表达式的场景：
1. 提取格式化ID: ['ABC-12345', 'XYZ-67890']
2. 验证邮箱格式: user@example.com 是否有效? True
3. 复杂替换: 加粗文本 和 斜体文本
4. 从非结构化文本提取: 价格从 100 到 200，尺寸 15x20

2. 不适合使用正则表达式的场景

from bs4 import BeautifulSoup
import redef when_not_to_use_regex():print("\n不适合使用正则表达式的场景：")# 1. 解析结构良好的HTML/XMLhtml = """<div class="product"><h2>产品名称</h2><p class="price">¥99.99</p><ul class="features"><li>特性1</li><li>特性2</li></ul></div>"""print("1. 解析HTML:")print("  使用正则表达式(不推荐):")title_regex = re.search(r'<h2>(.*?)</h2>', html)price_regex = re.search(r'<p class="price">(.*?)</p>', html)features_regex = re.findall(r'<li>(.*?)</li>', html)print(f"  - 标题: {title_regex.group(1) if title_regex else 'Not found'}")print(f"  - 价格: {price_regex.group(1) if price_regex else 'Not found'}")print(f"  - 特性: {features_regex}")print("\n  使用BeautifulSoup(推荐):")soup = BeautifulSoup(html, 'lxml')title_bs = soup.find('h2').textprice_bs = soup.find('p', class_='price').textfeatures_bs = [li.text for li in soup.find_all('li')]print(f"  - 标题: {title_bs}")print(f"  - 价格: {price_bs}")print(f"  - 特性: {features_bs}")# 2. 简单的字符串操作print("\n2. 简单字符串操作:")text = "Hello, World!"print("  使用正则表达式(不推荐):")replaced_regex = re.sub(r'Hello', 'Hi', text)contains_world_regex = bool(re.search(r'World', text))print(f"  - 替换: {replaced_regex}")print(f"  - 包含'World'? {contains_world_regex}")print("\n  使用字符串方法(推荐):")replaced_str = text.replace('Hello', 'Hi')contains_world_str = 'World' in textprint(f"  - 替换: {replaced_str}")print(f"  - 包含'World'? {contains_world_str}")# 3. 处理复杂的嵌套结构nested_html = """<div><p>段落1</p><div><p>嵌套段落</p></div><p>段落2</p></div>"""print("\n3. 处理嵌套结构:")print("  使用正则表达式(困难且容易出错):")paragraphs_regex = re.findall(r'<p>(.*?)</p>', nested_html)print(f"  - 所有段落: {paragraphs_regex}  # 无法区分嵌套层级")print("\n  使用BeautifulSoup(推荐):")soup = BeautifulSoup(nested_html, 'lxml')top_paragraphs = [p.text for p in soup.find('div').find_all('p', recursive=False)]nested_paragraphs = [p.text for p in soup.find('div').find('div').find_all('p')]print(f"  - 顶层段落: {top_paragraphs}")print(f"  - 嵌套段落: {nested_paragraphs}")when_not_to_use_regex()

运行结果：

不适合使用正则表达式的场景：
1. 解析HTML:使用正则表达式(不推荐):- 标题: 产品名称- 价格: ¥99.99- 特性: ['特性1', '特性2']使用BeautifulSoup(推荐):- 标题: 产品名称- 价格: ¥99.99- 特性: ['特性1', '特性2']2. 简单字符串操作:使用正则表达式(不推荐):- 替换: Hi, World!- 包含'World'? True使用字符串方法(推荐):- 替换: Hi, World!- 包含'World'? True3. 处理嵌套结构:使用正则表达式(困难且容易出错):- 所有段落: ['段落1', '嵌套段落', '段落2']  # 无法区分嵌套层级使用BeautifulSoup(推荐):- 顶层段落: ['段落1', '段落2']- 嵌套段落: ['嵌套段落']

九、总结

正则表达式是网页爬虫中不可或缺的工具，特别适合处理以下场景：

提取特定格式的数据：如邮箱、电话号码、价格等
清洗和规范化文本：去除HTML标签、过滤特殊字符等
验证数据格式：检查数据是否符合特定模式
从非结构化或半结构化文本中提取信息

在使用正则表达式时，请记住以下最佳实践：

使用原始字符串：在Python中，使用r前缀表示原始字符串，避免转义问题
预编译频繁使用的正则表达式：使用re.compile()提高性能
使用命名分组增强可读性：使用(?P<name>...)语法
注意贪婪与非贪婪匹配：使用*?、+?等非贪婪量词
适当使用标志：如re.IGNORECASE、re.DOTALL等
不要过度依赖正则表达式：对于结构化HTML，优先使用专门的解析库

下一篇：【Python爬虫详解】第六篇：处理动态加载的网页内容

网络营销策划书的主要内容_微商城电商系统开发商_百度收录的网站_河南品牌网站建设

一、正则表达式简介

二、Python re模块基础

重要的re模块函数和方法

Match对象常用方法

三、正则表达式语法

1. 基本字符匹配

2. 常用的预定义字符集

3. 实际示例

4. 分组与引用

5. 贪婪匹配与非贪婪匹配

四、在网页爬虫中使用正则表达式

1. 提取网页中的所有链接

2. 提取新闻网页中的日期和标题

3. 从电商网站提取商品价格

4. 使用正则表达式清理数据

五、实际案例：分析一个完整的网页

六、正则表达式优化与最佳实践

1. 性能优化

2. 正则表达式最佳实践

3. 常见错误和陷阱

七、正则表达式与其他解析方法的结合

八、何时使用正则表达式，何时不使用

1. 适合使用正则表达式的场景

2. 不适合使用正则表达式的场景

九、总结

最新新闻

热搜词