您现在的位置是：首页 > 其它

当前栏目

BeautifulSoup的详细使用

详细 BeautifulSoup 使用

2023-09-14 08:59:02 时间

BEAUTIFUL SOUP库
Beautiful Soup：美味汤
非常优秀的python第三方库
能够对html、xml格式进行解析，并且提取其中的相关信息
Beautiful Soup可以对你提供给他的任何格式进行相关的爬取，并且可以进行树形解析
使用原理：把任何你给他的文档当成一锅汤，然后煲制这锅汤

一、安装：

pip3 install beautifulsoup4

二，库的引入及解析

from bs4 import BeautifulSoup
soup = BeautifulSoup('< html >data< /html >','html.parser')#解析标签树即BeautifulSoup类
soup1 = BeautifulSoup('open('D://demo.html')','html.parser')#解析文件

三，bs4库的4种解析器

解析器                    使用方法                                 条件
bs4的HTML解析器 BeautifulSoup(mk, 'html.parser')   安装bs4库
lxml的HTML解析器 BeautifulSoup(mk, 'lxml')            pip install lxml
lxml的XML解析器 BeautifulSoup(mk, 'xml')               pip install lxml
html5lib的解析器 BeautifulSoup(mk, 'html5lib')          pip install html5lib

四，bs库的基本元素

1，Tag标签：<>与</>，开始与结束符号
2，Name标签名：< tag >.name（eg：< p >…< /p >的标签名是p）
3，Attributes标签属性：< tag >.attrs，字典形式
4，NavigableString标签内字符串：< tag >.string（eg：< p >data< /p >的字符串是data）
5，Comment：标签内字符串的注释

import requests
from bs4 import BeautifulSoup

url = 'https://top.baidu.com/board?tab=realtime'

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}

rsp = requests.get(url=url, headers=headers)

# html = rsp.content.decode("utf-8")
html = rsp.text
bs = BeautifulSoup(html, "html.parser")

# print(bs.prettify())  # 输出解析后的html页面
# print(bs.title)  # 查看解析页面title
print(bs.a)  # 查看解析页面的指定标签（a是标签名），若有重复标签a，只返回第一个标签a的内容
print(bs.a.attrs)  # 查看标签a的属性，
print(type(bs.a.attrs))  # 查看标签a的属性的类型，这是返回一个字典
print(bs.a.attrs['class'])  # 查看标签a的属性中class对应的值（与查看字典的值的方法相同）
print(bs.a.string)  # 查看标签a的字符串

# 区分注释与标签：可通过二者的字符串类型加以区分
newsoup = BeautifulSoup('<b><!--This is a comment--></b><p>This is a comment</p>', 'html.parser')
print(newsoup.b.string)  # 查看注释的字符串
print(type(newsoup.b.string))  # 查看注释的字符串的类型（<class 'bs4.element.comment'>）
print(newsoup.p.string)  # 查看标签p的字符串
print(type(newsoup.p.string))  # 查看标签p的字符串的类型（<class 'bs4.element.NavigableString'>）

五，信息查找

方法	说明	结果类型
<>.find_all()	检索所有内容，返回所有结果	列表
<>.find()	检索所有内容，返回一个结果	字符串
<>.find_parents()	检索所有先辈内容，返回所有结果	列表
<>.find_parent()	检索所有先辈内容，返回一个结果	字符串
<>.find_nex_siblings()	检索所有后续平行节点内容，返回所有结果	列表
<>.find_next_sibling()	检索所有后续平行节点内容，返回一个结果	字符串
<>.find_previous_siblings()	检索所有前序平行节点内容，返回所有结果	列表
<>.find_previous_sibling()	检索所有前序平行节点内容，返回一个结果	字符串


# for link in bs.find_all('a'):
#     print(link)  # 输出每一个标签名为a的标签
#     print(link.get('href'))  # 输出标a标签中的href的值即url
# print(bs.find_all('a'))  # 输出soup的所有a标签,同print(soup('a'))
# print(bs.find_all(['a', 'b']))  # 输出soup的所有a标签与b标签
# print(bs.find_all(True))
# for i in bs.find_all(True):  # 输出soup的所有标签名称
#     print(i.name)
# for i in bs.find_all(re.compile('b')):  # 输出以b开头的所有标签名称
#     print(i.name)
# print(bs.find_all(id='link1'))  # 输出属性id='link1'的所有标签
# print(bs.find_all('a', id='link1'))  # 输出标签名是a，属性id='link1'的所有标签
# print(bs.find_all('div', attrs={'class': 'category-wrap_iQLoo horizontal_1eKyQ'}))  # 输出标签名是a，属性id='link1'的所有标签
print(bs.find_all('div', class_='category-wrap_iQLoo horizontal_1eKyQ'))  # # 而对于 class 来 说，由于 class 在 Python 里是一个关键字，所以后面需要加一个下划线，
# print(bs.find_all('a', id=re.compile('link')))  # 输出标签名是a，属性id等于以link开头的值的所有标签
# print(bs.find_all('a', id=re.compile('link')))  # 输出标签名是a，属性id等于以link开头的值的所有标签
# print(bs.find_all('a', id=re.compile('link')))  # 输出标签名是a，属性id等于以link开头的值的所有标签
# print(bs.find_all(string='Basic Python'))  # 输出内容为Basic Python的文本组成的列表
# print(bs.find_all(string=re.compile('Python')))  # 输出内容含Python的文本组成的列表
# print(bs('a'))

六，发送邮件

mail_host = 'smtp.qq.com'
mail_port = '465'
login_sender = 'XXX@qq.com'
login_pass = 'XXX'. #
str = "get_baidu_six_hot"
sendName = "XX@qq.com"
resName = "XXX@qq.com"
title = "get_baidu_six_hot"

def sendQQ(receivers):

msg = MIMEMultipart(str,'related')
# 发送excel-附件
message_xlsx = MIMEText(open('baidu_hot_six.xlsx', 'rb').read(), 'base64', 'utf-8')
message_xlsx['Content-disposition'] = 'attachment;filename="baidu_hot_six.xlsx'
msg.attach(message_xlsx)

# 发送py-附件
message_py = MIMEText(open('get_baidu_six_hot.py', 'rb').read(), 'base64', 'utf-8')
message_py['Content-disposition'] = 'attachment;filename="get_baidu_six_hot.py'
msg.attach(message_py)

msg['From'] = formataddr([sendName, login_sender])
# 邮件的标题
msg['Subject'] = title
try:
        server = smtplib.SMTP_SSL(mail_host, mail_port)
        server.login(login_sender, login_pass)
        server.sendmail(login_sender, receivers, msg.as_string())
        print("已发送到" + "，".join(receivers) + "的邮箱中！")
        server.quit()

except smtplib.SMTPException:
    print("发送邮箱失败！")

sendQQ(['XXXX@qq.com', 'XXX@qq.com'])

猜你喜欢

MySQL Error number: MY-013851; Symbol: ER_ACCOUNT_WITH_EXPIRED_PASSWORD; SQLSTATE: HY000 报错故障修复远程处理
ORACLE数据库溢出灾难:记录十年来最严重的一次（ORACLE溢出）
实现Redis主从复制：一步一步（redis主从关系）
独立服务器如何做好防护工作？
ORA-64009: invalid provider specified ORACLE 报错故障修复远程处理
mysql数据库备份方法_oracle数据库备份文件格式
Oracle二次安装失败如何解决（oracle二次安装失败）
这次彻底读透 Redis，网友：已收藏！
本月Win10累积更新重点修复PrintNightmare高危漏洞
WordPress 6.2 发布，全面提升站点编辑体验
ORA-02739: osncon: host alias is too long ORACLE 报错故障修复远程处理
Linux下新启终端：探索无限可能（linux新开终端）

相关主题

Java使用线程池
webpack 使用
使用Condition
git简单使用
GitBook 使用
okhttp 的使用
ip 命令的使用
C/C++函数使用
PHP:CURL的使用
mac使用
synchronized使用
迭代器的使用

zl程序教程