您现在的位置是：首页 > 其他

当前栏目

39 爬虫 - BeautifulSoup4搜索文档树

文档搜索爬虫 39

2023-09-11 14:15:43 时间

`find_all(name, attrs, recursive, text, **kwargs)`

1）name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串
最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式
如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到。

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

C.传列表
如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2）keyword 参数

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3）text 参数

通过 text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

猜你喜欢

基于FPGA的UWB通信/定位系统的开发介绍——详细版
ROS机器人程序设计（原书第2版）3.2.7　使用rqt_console和rqt_logger_level在运行时修改调试级别
Nifi 老是死机
SwiftUI WWDC21 新特性快览之 02 Dismiss action 取消关闭视图sheet或navigation
设备管理-借还模块界面代码
jedis池的作用
洛谷 P3366 【模板】最小生成树
C#，卡特兰数（Catalan number，明安图数）的算法源代码
supersocket中quickstart文件夹下的MultipleCommandAssembly的配置文件分析
《Adobe Dreamweaver CS6中文版经典教程》——第2课　HTML基础2.1　什么是HTML
让U盘永不中毒的解决办法
python爬虫入门（五）Selenium模拟用户操作
《惢客创业日记》2018.10.07（周日）惢客初期创业合伙人计划出炉（创业者）
GitLab的使用之Git-biz push失败问题整理
物联网智能技术引领互联网新风潮

相关主题

ORACLE 官方文档
Oracle文档
参考文档
Dom文档模型
搜索文档树
qt多文档
Python文档