您现在的位置是：首页 > 后端

当前栏目

python BeautifulSoup抓取Web上的数据

Python Web 数据抓取 BeautifulSoup

2023-09-11 14:15:45 时间

python 模块 BeautifulSoup 抓取 Web 数据

互联网上海量的数据是任何研究领域或个人兴趣的丰富资源。为了有效地收集这些数据，您需要熟练掌握网络爬虫。
您将学习如何：

使用requests、Beautiful Soup从 Web抓取和解析数据
从头到尾遍历网络抓取管道
构建一个从 Web 获取工作机会并在控制台中显示相关信息的脚本

在这里插入图片描述

1. Beautiful Soup 简介

简单来说，beautiful Soup 是 python 的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup 自动将输入文档转换为 Unicode 编码，输出文档转换为 utf-8 编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup 就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup 已成为和 lxml、html6lib 一样出色的 python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2. Beautiful Soup 安装

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用 Beautiful Soup 4，不过它已经被移植到 BS4 了，也就是说导入时我们需要 import bs4 。所以这里我们用的版本是 Beautiful Soup 4.3.2 (简称 BS4)，另外据说 BS4 对 Python3 的支持不够好，不过我用的是 Python2.7.7，如果有小伙伴用的是 Python3 版本，可以考虑下载 BS3 版本。可以利用 pip 或者 easy_install 来安装，以下两种方法均可。

easy_install beautifulsoup4
#或者
pip install beautifulsoup4

如果想安装最新的版本，请直接下载安装包来手动安装，下载完成之后解压运行下面的命令即可完成安装。

sudo python setup.py install

然后需要安装 lxml

easy_install lxml
#或者
pip install lxml
#或者
sudo apt-get install libxml2-dev libxslt-dev python-dev
wget https://github.com/lxml/lxml/releases/download/lxml-4.8.0/lxml-4.8.0.tar.gz
tar -zxvf lxml-4.8.0.tar.gz
cd lxml-4.8.0/
python3 setup.py install

另一个可供选择的解析器是纯 Python 实现的 html5lib , html5lib 的解析方式与浏览器相同，可以选择下列方法来安装 html5lib:

easy_install html5lib
#or
pip install html5lib

3. 创建 Beautiful Soup 对象

解析器：

lxml
html.parser

3.1 第一种: html变量

prettify()获取，解析器html.parser
demo.py

from bs4 import BeautifulSoup


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#第一种：创建 beautifulsoup 对象
soup = BeautifulSoup(html,'html.parser')
#soup = BeautifulSoup(html,'lxml')

print(soup.prettify())

执行python3 demo.py,输出：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

3.2 第二种 open

将html本地化index.html，创建 beautifulsoup 对象

#soup = BeautifulSoup(open('index.html'),'html.parser')

3.3 第三种 with

index2.html

<!DOCTYPE html>
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8">                   
    </head>

    <body>
        <h2>Operating systems</h2>

        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>                      
            <li>NetBSD</li>           
            <li>Windows</li>         
        </ul>

        <p>
          FreeBSD is an advanced computer operating system used to 
          power modern servers, desktops, and embedded platforms.
        </p>

        <p>
          Debian is a Unix-like computer operating system that is 
          composed entirely of free software.
        </p>        

    </body>    
</html>

demo2.py

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.h2)
    print(soup.head)
    print(soup.li)
    print("HTML: {0}, name: {1}, text: {2}".format(soup.h2, 
        soup.h2.name, soup.h2.text))


#<h2>Operating systems</h2>
#<head>
#<title>Header</title>
#<meta charset="utf-8"/>
#</head>
#<li>Solaris</li>
#HTML: <title>Header</title>, name: title, text: Header

3.4 第四种：网页抓取

#!/usr/bin/python3

from bs4 import BeautifulSoup
import requests as req

resp = req.get("http://www.something.com")

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

执行：

$ python3 scraping.py 
<title>Something.</title>
Something.
<head><title>Something.</title></head>

4. Beautiful Soup获取tag

以demo.py为例

soup.title
soup.head
soup.name
soup.a
soup.p
soup.attrs
soup.h2
soup.li

print(soup.title)
#<title>The Dormouse's story</title>

print(soup.head)
#<head><title>The Dormouse's story</title></head>

print(soup.a)
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print(soup.name)
print(soup.head.name)
#[document]
#head

print type(soup.name)
#<type 'unicode'>

print soup.attrs 
#{} 空字典



print(type(soup.a))
#<class 'bs4.element.Tag'>
print(soup.a)
#a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容
print(soup.a.string)
print(type(soup.a.string))
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
# Elsie 
#<class 'bs4.element.Comment'>


print(soup.name)
print(soup.head.name)
#[document]
#head



#获取p
print(soup.p.attrs)
#{'class': ['title'], 'name': 'dromouse'}

print(soup.p['class'])
#['title']

print(soup.p.get('class'))
#['title']

soup.p['class']="newClass"
print(soup.p)
#<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

del soup.p['class']
print(soup.p)
#<p name="dromouse"><b>The Dormouse's story</b></p>

#获取标签内部的文字
```bash
print(soup.p.string)
#The Dormouse's story

print(type(soup.p.string))
#<class 'bs4.element.NavigableString'>

5. Beautiful Soup 遍历

5.1 recursiveChildGenerator()

demo2.py

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for child in soup.recursiveChildGenerator():

        if child.name:

            print(child.name)

执行：

$ python3  demo2.py
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

5.2 children 子元素

get_children.py

!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.html

    print(root.head.contents)
    print(root.head.contents[0])
    print(root.head.children)

    root_childs = [e.name for e in root.children if e.name is not None]
    print(root_childs)

执行：

$ python3 demo3.py
['\n', <title>Header</title>, '\n', <meta charset="utf-8"/>, '\n']
<list_iterator object at 0x7f76282a5f98>
['head', 'body']

5.3 descendants 后继元素

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.body

    root_childs = [e.name for e in root.descendants if e.name is not None]
    print(root_childs)

执行

$ python3 get_descendants.py 
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

在这里插入图片描述

6. find_all( name , attrs , recursive , text , **kwargs )

6.1 tag

find_all () 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件 1）name 参数 name 参数可以查找所有名字为 name 的 tag, 字符串对象会被自动忽略掉 A. 传字符串最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，Beautiful Soup 会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有的标签。

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,'html.parser')

print(soup.find_all('b'))
print(soup.find_all('a'))
#[<b>The Dormouse's story</b>]
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传正则表达式如果传入正则表达式作为参数，Beautiful Soup 会通过正则表达式的 match () 来匹配内容。下面例子中找出所有以 b 开头的标签，这表示和标签都应该被找到

for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

传列表如果传入列表参数，Beautiful Soup 会将与列表中任一元素匹配的内容返回。

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传 True True 可以匹配任何值，下面代码查找到所有的 tag, 但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)

#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p

传方法如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数 [4] , 如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False 下面方法校验了当前元素，如果包含 class 属性却不包含 id 属性，那么将返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all () 方法，将得到所有

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

6.2 keyword 参数

注意：如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索，如果包含一个名字为 id 的参数，Beautiful Soup 会搜索每个 tag 的”id” 属性。

print(soup.find_all(id='link2'))
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入 href 参数，Beautiful Soup 会搜索每个 tag 的”href” 属性

print(soup.find_all(href=re.compile("elsie")))
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

使用多个指定名字的参数可以同时过滤 tag 的多个属性

soup.find_all(href=re.compile("elsie"), id='link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以

print(soup.find_all("a", class_="sister"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些 tag 属性在搜索不能使用，比如 HTML5 中的 data-* 属性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
print(data_soup.find_all(data-foo="value"))
# SyntaxError: keyword can't be an expression

但是可以通过 find_all () 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的 tag

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')
print(data_soup.find_all(attrs={"data-foo": "value"}))
#[<div data-foo="value">foo!</div>]

6.3 text 参数

通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，text 参数接受字符串，正则表达式，列表，True

print(soup.find_all(text="Elsie"))
# [u'Elsie']

print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
# [u'Elsie', u'Lacie', u'Tillie']

print(soup.find_all(text=re.compile("Dormouse")))
[u"The Dormouse's story", u"The Dormouse's story"]

limit 参数 find_all () 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与 SQL 中的 limit 关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。文档树中有 3 个 tag 符合搜索条件，但结果只返回了 2 个，因为我们限制了返回数量。

print(soup.find_all("a", limit=2))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive 参数调用 tag 的 find_all () 方法时，Beautiful Soup 会检索当前 tag 的所有子孙节点，如果只想搜索 tag 的直接子节点，可以使用参数 recursive=False . 一段简单的文档:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

是否使用 recursive 参数的搜索结果:

print(soup.html.find_all("title"))
#[<title>The Dormouse's story</title>]
print(soup.html.find_all("title", recursive=False))
#[]

7. find( name , attrs , recursive , text , **kwargs )

它与 find_all () 方法唯一的区别是 find_all () 方法的返回结果是值包含一个元素的列表，而 find () 方法直接返回结果

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    #print(soup.find("ul", attrs={ "id" : "mylist"}))
    print(soup.find("ul", id="mylist"))

执行：

$ python3 demo4.py
<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

8. BeautifulSoup CSS 选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id 名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

8.1 通过标签名查找

print(soup.select('title'))
#[<title>The Dormouse's story</title>]


print(soup.select('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


print(soup.select('b'))
#[<b>The Dormouse's story</b>]

8.2 通过类名查找

print(soup.select('.sister'))
##[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

8.3 通过 id 名查找

print(soup.select('#link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

8.4 组合查找

组合查找即和写 class 文件时，标签名与类名、id 名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1 的内容，二者需要用空格分开

print(soup.select('p #link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找

print(soup.select("head > title"))
#[<title>The Dormouse's story</title>]

8.5 属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print(soup.select('p a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text () 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.select("li:nth-of-type(3)"))

#[<li>Debian</li>]

9. BeautifulSoup 追加元素

append()方法将新标签附加到 HTML 文档。

append_tag.py

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'

    ultag = soup.ul

    ultag.append(newtag)

    print(ultag.prettify())

输出

<ul id="mylist" style="width:150px">
 <li>
  Solaris
 </li>
 <li>
  FreeBSD
 </li>
 <li>
  Debian
 </li>
 <li>
  NetBSD
 </li>
 <li>
  Windows
 </li>
 <li>
  OpenBSD
 </li>
</ul>

10. BeautifulSoup 插入元素

insert()方法在指定位置插入标签。

insert_tag.py

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')
    newtag.string='OpenBSD'

    ultag = soup.ul

    ultag.insert(2, newtag)

    print(ultag.prettify())

输出：

<ul id="mylist" style="width:150px">
 <li>
  Solaris
 </li>
 <li>
  OpenBSD
 </li>
 <li>
  FreeBSD
 </li>
 <li>
  Debian
 </li>
 <li>
  NetBSD
 </li>
 <li>
  Windows
 </li>
</ul>

11. BeautifulSoup 替换文字

replace_with()替换元素的文本。

replace_text.py

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tag = soup.find(text="Windows")
    tag.replace_with("OpenBSD")

    print(soup.ul.prettify())

输出

<ul id="mylist" style="width:150px">
 <li>
  Solaris
 </li>
 <li>
  FreeBSD
 </li>
 <li>
  Debian
 </li>
 <li>
  NetBSD
 </li>
 <li>
  OpenBSD
 </li>
</ul>

12. BeautifulSoup 删除元素

decompose()方法从树中删除标签并销毁它。

decompose_tag.py

#!/usr/bin/python3

from bs4 import BeautifulSoup

with open("index2.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    ptag2 = soup.select_one("p:nth-of-type(2)")

    ptag2.decompose()

    print(soup.body.prettify())

输出

<body>
 <h2>
  Operating systems
 </h2>
 <ul id="mylist" style="width:150px">
  <li>
   Solaris
  </li>
  <li>
   FreeBSD
  </li>
  <li>
   Debian
  </li>
  <li>
   NetBSD
  </li>
  <li>
   Windows
  </li>
 </ul>
 <p>
  FreeBSD is an advanced computer operating system used to 
          power modern servers, desktops, and embedded platforms.
 </p>
</body>

参考：

在这里插入图片描述

猜你喜欢

《HBase企业应用开发实战》—— 2.5　停止HBase集群
Mysql和redis事物对比
Android 应用程序恶意软件分析
webpack常见问题收藏
【bzoj4423】[AMPPZ2013]Bytehattan 对偶图+并查集
Git常用指令教程系列（一）
Metasploit运行环境内存不要低于2GB
java（Android）跨Module调用对应类方法需求解决方案
软考高级-信息系统管理师之风险管理（最新版）

相关主题

Java Web(五)Web
python注释
python的if语句
python web 框架
Python超矩形

zl程序教程

当前栏目

python BeautifulSoup抓取Web上的数据

python 模块 BeautifulSoup 抓取 Web 数据

文章目录

1. Beautiful Soup 简介

2. Beautiful Soup 安装

3. 创建 Beautiful Soup 对象

3.1 第一种: html变量

3.2 第二种 open

3.3 第三种 with

3.4 第四种：网页抓取

4. Beautiful Soup获取tag

5. Beautiful Soup 遍历

5.1 recursiveChildGenerator()

5.2 children 子元素

5.3 descendants 后继元素

6. find_all( name , attrs , recursive , text , **kwargs )

6.1 tag

6.2 keyword 参数

6.3 text 参数

7. find( name , attrs , recursive , text , **kwargs )

8. BeautifulSoup CSS 选择器

8.1 通过标签名查找

8.2 通过类名查找

8.3 通过 id 名查找

8.4 组合查找

8.5 属性查找

9. BeautifulSoup 追加元素

10. BeautifulSoup 插入元素

11. BeautifulSoup 替换文字

12. BeautifulSoup 删除元素

相关文章