您现在的位置是：首页 > 前端

当前栏目

CSS 选择器：BeautifulSoup4解析器（一）

CSS 选择器解析器

2023-09-27 14:25:57 时间

和 lxml 一样 Beautiful Soup 也是一个HTML/XML的解析器主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历而Beautiful Soup 是基于HTML DOM的会载入整个文档解析整个DOM树因此时间和内存开销都会大很多所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单 API非常人性化支持CSS选择器、Python标准库中的HTML解析器也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已经停止开发推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可 pip install beautifulsoup4

官方文档 http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

示例

首先必须要导入 bs4 库

# beautifulsoup4_test.py

from bs4 import BeautifulSoup

html 

 html head title The Dormouse s story /title /head 

 body 

 p class title name dromouse b The Dormouse s story /b /p 

 p class story Once upon a time there were three little sisters; and their names were

 a href http://example.com/elsie class sister id link1 !-- Elsie -- /a ,

 a href http://example.com/lacie class sister id link2 Lacie /a and

 a href http://example.com/tillie class sister id link3 Tillie /a 

and they lived at the bottom of a well. /p 

 p class story ... /p 

#创建 Beautiful Soup 对象

soup BeautifulSoup(html)

#打开本地 HTML 文件的方式来创建对象

#soup BeautifulSoup(open( index.html ))

#格式化输出 soup 对象的内容

print soup.prettify()

运行结果

 html 

 head 

 title 

 The Dormouse s story

 /title 

 /head 

 body 

 p class title name dromouse 

 The Dormouse s story

 p class story 

 Once upon a time there were three little sisters; and their names were

 a class sister href http://example.com/elsie id link1 

 !-- Elsie -- 

 a class sister href http://example.com/lacie id link2 

 Lacie

 a class sister href http://example.com/tillie id link3 

 Tillie

and they lived at the bottom of a well.

 p class story 

 /body 

 /html

如果我们在 IPython2 下执行会看到这样一段警告

意思是如果我们没有显式地指定解析器所以默认使用这个系统的最佳可用HTML解析器(“lxml”)。如果你在另一个系统中运行这段代码或者在不同的虚拟环境中使用不同的解析器造成行为不同。

但是我们可以通过soup BeautifulSoup(html,“lxml”)方式指定lxml解析器。

四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag

NavigableStringBeautifulSoupComment

1. Tag

Tag 通俗点讲就是 HTML 中的一个个标签例如

 head title The Dormouse s story /title /head 

 a class sister href http://example.com/elsie id link1 !-- Elsie -- /a 

 p class title name dromouse b The Dormouse s story /b /p

上面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag 那么试着使用 Beautiful Soup 来获取 Tags:

from bs4 import BeautifulSoup

html 

 html head title The Dormouse s story /title /head 

 body 

 p class title name dromouse b The Dormouse s story /b /p 

 p class story Once upon a time there were three little sisters; and their names were

 a href http://example.com/elsie class sister id link1 !-- Elsie -- /a ,

 a href http://example.com/lacie class sister id link2 Lacie /a and

 a href http://example.com/tillie class sister id link3 Tillie /a 

and they lived at the bottom of a well. /p 

 p class story ... /p 

#创建 Beautiful Soup 对象

soup BeautifulSoup(html)

print soup.title

# title The Dormouse s story /title 

print soup.head

# head title The Dormouse s story /title /head 

print soup.a

# a class sister href http://example.com/elsie id link1 !-- Elsie -- /a 

print soup.p

# p class title name dromouse b The Dormouse s story /b /p 

print type(soup.p)

# class bs4.element.Tag

我们可以利用 soup 加标签名轻松地获取这些标签的内容这些对象的类型是bs4.element.Tag。但是注意它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签后面会进行介绍。

对于 Tag 它有两个重要的属性是 name 和 attrs

print soup.name

# [document] #soup 对象本身比较特殊 它的 name 即为 [document]

print soup.head.name

# head #对于其他内部标签 输出的值便为标签本身的名称

print soup.p.attrs

# { class : [ title ], name : dromouse }

# 在这里 我们把 p 标签的所有属性打印输出了出来 得到的类型是一个字典。

print soup.p[ class ] # soup.p.get( class )

# [ title ] #还可以利用get方法 传入属性的名称 二者是等价的

soup.p[ class ] newClass 

print soup.p # 可以对这些属性和内容等等进行修改

# p class newClass name dromouse b The Dormouse s story /b /p 

del soup.p[ class ] # 还可以对这个属性进行删除

print soup.p

# p name dromouse b The Dormouse s story /b /p

2. NavigableString

既然我们已经得到了标签的内容那么问题来了我们要想获取标签内部的文字怎么办呢很简单用 .string 即可例如

print soup.p.string

# The Dormouse s story

print type(soup.p.string)

# In [13]: class bs4.element.NavigableString

3. BeautifulSoup

BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象是一个特殊的 Tag 我们可以分别获取它的类型名称以及属性来感受一下

print type(soup.name)

# type unicode 

print soup.name 

# [document]

print soup.attrs # 文档本身的属性为空

# {}

4. Comment

Comment 对象是一个特殊类型的 NavigableString 对象其输出的内容不包括注释符号。

print soup.a

# a class sister href http://example.com/elsie id link1 !-- Elsie -- /a 

print soup.a.string

# Elsie 

print type(soup.a.string)

# class bs4.element.Comment 

a 标签里的内容实际上是注释 但是如果我们利用 .string 来输出它的内容时 注释符号已经去掉了。

前端祖传三件套CSS的各种选择器之组合/复合选择器前端开发者经常使用CSS来定义网页样式，包括颜色、布局和字体等。在CSS中，选择器是指用于选择HTML元素并应用样式的模式。有许多不同类型的CSS选择器可供使用，但本文将着重介绍组合/复合选择器。
前端祖传三件套CSS的各种选择器之class选择器在前端开发中，CSS是不可或缺的一部分，而选择器则是CSS最重要的组成部分之一。其中，class选择器被广泛应用于HTML文档中，可以根据元素的class属性值来选取HTML元素，并为其添加样式。以下将详细介绍class选择器的使用方法以及应用场景。
前端祖传三件套CSS的各种选择器之id选择器在CSS中，选择器是用来选取HTML元素的一种方式，而id选择器则是其中最常用也最重要的一种。id选择器可以根据元素的唯一id属性来选取HTML元素，并为其添加样式。以下将详细介绍id选择器的使用方法以及应用场景。
前端祖传三件套CSS的各种选择器之属性选择器当今互联网时代，前端开发已成为互联网领域不可或缺的一部分。而CSS则是前端开发中最为重要的技术之一，它用于定义HTML文档的呈现方式，从而使得网页可以更加美观、功能更加强大。在CSS中，选择器是一个非常重要的概念，其中属性选择器更是被称为祖传三件套之一。
前端祖传三件套CSS的各种选择器之标签选择器 CSS是前端开发中最基础和最重要的技术之一。它可以通过样式定义来控制页面元素的外观和布局。在这篇文章中，我们将介绍CSS的选择器之一——标签选择器。
css选择器以及权重这次我是真的弄懂了 css作为前端的三大基石，对于我们前端开发来说极其重要。其中css选择器在日常开发中天天会碰到，但是每种类型的选择器你真的都弄懂弄透彻了吗？下面请跟随笔者的步伐在来温习一遍。希望能对你有所帮助。
Lansonli CSDN大数据领域博客专家，华为云享专家、阿里云专家博主、腾云先锋（TDP）核心成员、51CTO专家博主，全网六万多粉丝，知名互联网公司大数据高级开发工程师

猜你喜欢

枚举实现工厂模式
动态产生radio并在onchange后获取其值
Flutter CustomPaint 与 Canvas
异常处理----使用 try…catch…finally 处理异常
用Python自动生成Excel数据报表！
浅谈互联网医疗面临的挑战
Kafka配置文档
Vue 2.0 构建单页应用最佳实战
脚本实现Unity 场景的淡入淡出
ubuntu对硬盘的“Load/Unload Cycle威胁”分析及官方解决办法
eclipse调试jdk源码
.Net Core with 微服务 - Elastic APM
【easyexcel】读取excel文件
【软件测试】带有支付功能的产品如何测试？
ps流格式解析和总结(改了一下排版)
DBA五大致命失误：你给谁开通了啥权限?

相关主题

CSS样式（二）
CSS技术
CSS书写顺序
CSS笔记(4)
CSS笔记(7)
CSS 链接样式
CSS-background
CSS - 选择器
Css Hack
css-清除浮动
压缩css
CSS居中对齐
04.CSS基础知识
CSS中的h1
CSS 布局入门
css: button
CSS：盒模型
CSS实现导航栏

zl程序教程

当前栏目

CSS 选择器：BeautifulSoup4解析器（一）

相关文章