您现在的位置是：首页 > 后端

当前栏目

Python网络数据采集读书笔记-1

Python 网络数据采集读书笔记

2023-09-14 09:07:13 时间

第一部分创建爬虫

第一章初见网络爬虫

1.1 网络连接

# py3 urllib
from urllib.request import urlopen

url = "http://www.baidu.com"
html = urlopen(url)
print(html.read())

官方文档：https://docs.python.org/3/library/urllib.html

1.2 BeautifulSoup

安装：

pip install beautifulsoup4

官方中文文档：
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html

推荐使用虚拟环境： Mac安装python环境以及虚拟环境

# py3 BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.baidu.com"
html = urlopen(url)

soup = BeautifulSoup(html.read()) # "html.parser"

print(soup.title)
# <title>百度一下，你就知道</title>

增加异常处理

# 异常处理
# 1、 服务器不存在
# 2、 网页不存在

# py3
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.baidu.com"

try:
    html = urlopen(url)

except Exception as e:
    print(e)

else:
    soup = BeautifulSoup(html.read(), "html.parser")
    print(soup.title)  # <title>百度一下，你就知道</title>
    tag = soup.xxxxx
    print(type(tag))  # <class 'NoneType'>

finally:
    pass

重新组织代码


# py3
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup

def getTitle(url):
    """
    异常处理
    1、 服务器不存在
    2、 网页不存在
    3、 标签不存在
    """
    try:
        html = urlopen(url)
    except (HTTPError, URLError) as e:
        return None

    try:
        soup = BeautifulSoup(html.read(), "html.parser")
        title = soup.head.title
    except AttributeError as e:
        return None

    return title


url = "http://www.baidu.com"
title = getTitle(url)
if title == None:
    print("title is None")
else:
    print(title)
    # < title > 百度一下，你就知道 < / title >

第二章复杂HTML解析

2.1 标签过滤

find, find_all 参数
name=None,      标签，或运算
attrs={},       属性，与运算
recursive=True, 递归 
text=None,      文本
limit=None,     范围限制  find <=> find_all(limit=1) 
**kwargs,       关键字 class_

url = "http://www.pythonscraping.com/pages/warandpeace.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
name_list = soup.find_all("span", {"class": "green"})
for name in name_list:
    print(name.get_text())
h1 = soup.find(text="Chapter 1")
print(h1)

2.2 BeautifulSoup对象

四个对象：
BeautifulSoup   文档
Tag             标签
NavigableString 标签文字
Comment         注释

2.3 导航树

# 孩子和后代
children 
descendants 

# 兄弟
next_siblings 
previous_siblings 
next_sibling
previous_sibling

# 父亲
parent
parents

url = "http://www.pythonscraping.com/pages/page3.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

table = soup.find("table", {"id": "giftList"})

from bs4.element import Tag

# for tr in table.children:
#     if isinstance(tr, Tag):
#         for td in tr.children:
#             print(td.get_text(), end="|")
#     print("\n")

img = table.find("img", {"src": "../img/gifts/img1.jpg"})
price = img.parent.previous_sibling.get_text()
print(price)

2.4 正则表达式

官方文档：https://docs.python.org/3/library/re.html
参考文章：Python编程：re正则库

正则邮箱：
[A-Za-z0-9\._+]+@[A-Za-z0-9]+\.(com|cn|org|edu|net)

import re

url = "http://www.pythonscraping.com/pages/page3.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
regex = re.compile(r"\.\.\/img\/gifts\/img.*\.jpg")
imgs = soup.find_all("img", {"src": regex})
for img in imgs:
    print(img.get("src")) # 获取属性 attrs

2.5 lambda表达式

# 返回True 或 False
tags = soup.find_all(lambda tag: len(tag.attrs) == 2)
print(tags)

2.6 html解析库

lxml： http://lxml.de/
html.parser：https://docs.python.org/3/library/html.parser.html

猜你喜欢

Jenkins 中的系统，主节点，节点，执行器等概念解释
MySQL表合并查询简易指南（mysql表合并查询）
Oracle数据库：安装与使用指南（oracle安装使用教程）
android检查手机和无线是否连接的方法
如何使用 OpenStack CLI – 每天5分钟玩转 OpenStack（22）
Oracle事务约束安全稳健的数据操作（oracle事务约束）
开源新闻速递：Linux AIO Ubuntu 16.04 发布
Springboot集成RocketMQ
Linux 简单指令
使用jqMobi开发app基础：通过panel添加内容详解手机开发
shutdown命令详解
深入解析Oracle字段设计技巧（oracle字段设计）
如何在Linux中复制文件并保留原始权限（linux复制权限）
.net预编译命令详解(图)
前端工程师leetcode算法面试之简单的二叉树
Linux 论坛：深度探索技术的乐园（linux论坛推荐）
谷歌公布Daydream手机要求，HTC Vive公布无线方案 | 沉浸感周刊
Linux文件类型概览：探索文件核心功能（linux文件的类型）
亚马逊寻求无线测试批准，或将用于无人机送货
Redis重复推送消息如何有效规避（redis重复推送消息）
语义分割步骤_实时语义分割

相关主题

Python: pip
Python求回文数
Python---爬虫
Python中的进程

zl程序教程

当前栏目

Python网络数据采集读书笔记-1

第一部分创建爬虫

第一章初见网络爬虫

1.1 网络连接

1.2 BeautifulSoup

第二章复杂HTML解析

2.1 标签过滤

2.2 BeautifulSoup对象

2.3 导航树

2.4 正则表达式

2.5 lambda表达式

2.6 html解析库

相关文章

当前栏目

Python网络数据采集读书笔记-1

第一部分 创建爬虫

第一章 初见网络爬虫

1.1 网络连接

1.2 BeautifulSoup

第二章 复杂HTML解析

2.1 标签过滤

2.2 BeautifulSoup对象

2.3 导航树

2.4 正则表达式

2.5 lambda表达式

2.6 html解析库

相关文章

第一部分创建爬虫

第一章初见网络爬虫

第二章复杂HTML解析