Python: BeautifulSoup库入门
2023-06-13 09:12:58 时间
文章背景:进行网络爬虫时,通过Requests模块获取网页的全部内容,借助BeautifulSoup模块从网页中提取内容。本文对BeautifulSoup模块的使用进行简单的介绍。
例子中用到的HTML页面地址:https://python123.io/ws/demo.html
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
标签树:
BeautifulSoup库是解析、遍历、维护标签树
的功能库。
1 BeautifulSoup库的解析器2 BeautifulSoup类的基本元素3 基于bs4库的HTML内容遍历方法3.1 标签树的下行遍历3.2 标签树的上行遍历3.3 标签树的平行遍历4 bs4库的prettify()方法
1 BeautifulSoup库的解析器
soup = BeautifulSoup('<html>data</html>','html.parser')
2 BeautifulSoup类的基本元素
<p class="title"> ... </p>
3 基于bs4库的HTML内容遍历方法
3.1 标签树的下行遍历
from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
soup.body.contents
['\n',
<p class="title"><b>The demo python introduces several python courses.</b></p>,
'\n',
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
'\n']
len(soup.body.contents)
5
# 遍历儿子节点
for child in soup.body.children:
print(child)
# 遍历子孙节点
for child in soup.body.descendants:
print(child)
3.2 标签树的上行遍历
from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
soup.title.parent
<head><title>This is a python demo page</title></head>
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
遍历所有先辈节点,包括soup本身。
3.3 标签树的平行遍历
平行遍历发生在同一个父节点下的各个节点之间。
from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
soup.a.next_sibling
' and '
soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
# 遍历后续节点
for sibling in soup.a.next_siblings:
print(sibling)
# 遍历前续节点
for sibling in soup.a.previous_siblings:
print(sibling)
4 bs4库的prettify()方法
from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
.prettify()方法为HTML文本<>及其内容增加'\n'。
参考资料:
[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)
相关文章
- 20·Python基础-单例模式四种实现方式
- Python进阶31-Django 分页器
- pycharm整理格式快捷键_python代码对齐快捷键
- Python入门系列(五)一篇搞懂python语句
- Python入门系列(十)一篇学会python文件处理
- Python 一行代码输出心形图案
- python监控网站更新_Python 通过网站search功能监控网站内容更新[通俗易懂]
- python chmod_Python os.chmod用法及代码示例
- python爬虫入门_在百度搜索手机归属地
- python机器学习库sklearn——朴素贝叶斯分类器[通俗易懂]
- python递归函数讲解_Python递归函数实例讲解
- redis学习 (key)键,Python操作redis 键 (二)详解大数据
- Python学习:6.python内置函数详解编程语言
- Linux下使用Python开发体验之旅(linux使用python)
- 一步步学习:利用Python连接MySQL数据库(python连接mysql数据库)
- 基于Linux环境下Python开发游戏之Pygame(linuxpygame)
- 对 Python 开发者而言,IPython 仍然是 Jupyter Notebook 的核心
- Linux系统下安装Python模块指南(linux安装python模块)
- python脚本实现统计日志文件中的ip访问次数代码分享