您现在的位置是：首页 > 后端

当前栏目

零基础学Python-爬虫-3、利用CSS选择器爬取整篇网络小说

Python 爬虫基础 CSS 利用选择器爬取

2023-09-14 09:04:58 时间

本套课程正式进入Python爬虫阶段，具体章节根据实际发布决定，可点击【python爬虫】分类专栏进行倒序观看：

【重点提示：请勿爬取有害他人或国家利益的内容，此课程虽可爬取互联网任意内容，但无任何收益，只为大家学习分享。】

开发环境：【Win10】

开发工具：【Visual Studio 2019】

Python版本：【3.7】

1、重新创建一个空项目【T3】：

2、需要用的模块【requests】【scrapy.selector】

from requests import get
from scrapy.selector import Selector

3、获取访问路径：爬取网站：【http://www.zongheng.com/】

就爬取第一个了：点击开始阅读

进入目录页面：

【审查元素找到对应元素位置】·【目录地址：http://book.zongheng.com/showchapter/1079911.html】

4、CSS选择器【scrapy.selector】

css选择器介绍

在css中选择器是一种模式，用于选择需要添加样式的元素，css对html页面中的元素实现一对一，一对多或者多对一的控制，都需要用到css选择器，html页面中的元素就是通过css选择器进行控制的；

css选择器的基本语法

类选择器：元素的class属性，比如class="box"表示选取class为box的元素；
ID选择器：元素的id属性，比如id="box"表示选取id为box的元素；
元素选择器：直接选择文档元素，比如p表示选择所有的p元素，div表示选择所有的div元素；
属性选择器：选择具有某个属性的元素，如*[title]表示选择所有包含title属性的元素、a[href]表示选择所有带有href属性的a元素等；
后代选择器：选择包含元素后代的元素，如li a表示选取所有li 下所有a元素；
子元素选择器：选择作为某元素子元素的元素，如h1 > strong表示选择父元素为h1 的所有 strong 元素；
相邻兄弟选择器：选择紧接在另一元素后的元素，且二者有相同父元素，如h1 + p表示选择紧接在 h1 元素之后的所有p元素；

scrapy 中的css使用方法

以a元素来举例说明

response.css('a')：返回的是selector对象；
response.css('a').extract()：返回的是a标签对象；
response.css('a::text').extract_first()：返回的是第一个a标签中文本的值；
response.css('a::attr(href)').extract_first()：返回的是第一个a标签中href属性的值；
response.css('a[href*=image]::attr(href)').extract()：返回所有a标签中href属性包含image的值；
response.css('a[href*=image] img::attr(src)').extract()：返回所有a标签下image标签的src属性；

5、根据目录地址获取所有章节页面信息

5.1、获取a标签：

5.2、获取所有章节访问路径：

from requests import get
from scrapy.selector import Selector

html=get("http://book.zongheng.com/showchapter/1079911.html").content.decode("utf-8")
sel=Selector(text=html)
result=sel.css("ul li a::attr(href)").extract()
for x in result:
    print(x)

加上判断：

from requests import get
from scrapy.selector import Selector

html=get("http://book.zongheng.com/showchapter/1079911.html").content.decode("utf-8")
sel=Selector(text=html)
result=sel.css("ul li a::attr(href)").extract()
for x in result:
    if "1079911" in x:
        print(x)

6、获取每个章节网址返回的信息（为了防止被封，测试中每次只访问2个）

from requests import get
from scrapy.selector import Selector

html=get("http://book.zongheng.com/showchapter/1079911.html").content.decode("utf-8")
sel=Selector(text=html)
result=sel.css("ul li a::attr(href)").extract()
#由于防止被封ID，故而测试的时候只访问前两个
count=3
for x in result:
    if "1079911" in x:
        count -= 1
        if count==0:
            break
        html=get(x).content.decode("utf-8")
        sel=Selector(text=html)
        title=sel.css("div.title_txtbox::text").extract()[0]
        print(title)
        info=sel.css("div.content p::text").extract()
        for j in info:
            print(j)

7、存储获取的信息为【txt文件】·测试过程还是2次循环

由于创建文件不能创建特殊符号，所以将【：】替换成了【_】

from requests import get
from scrapy.selector import Selector

html=get("http://book.zongheng.com/showchapter/1079911.html").content.decode("utf-8")
sel=Selector(text=html)
result=sel.css("ul li a::attr(href)").extract()
#由于防止被封ID，故而测试的时候只访问前两个
count=3

for x in result:
    if "1079911" in x:
        count -= 1
        if count==0:
            break
        html=get(x).content.decode("utf-8")
        sel=Selector(text=html)
        title=sel.css("div.title_txtbox::text").extract()[0]
        title=title.replace("：","_")
        info=sel.css("div.content p::text").extract()
        strInfo=""
        for j in info:
            strInfo+=j
        file=open(str.format("{0}{1}",title,".txt"),"w",encoding="utf-8")
        file.write(strInfo)
        file.close()

8、最终执行：(为了防止被封，故而每次访问间隔1~3s，加上time控制)

from requests import get
from scrapy.selector import Selector
import time
import random
html=get("http://book.zongheng.com/showchapter/1079911.html").content.decode("utf-8")
sel=Selector(text=html)
result=sel.css("ul li a::attr(href)").extract()

for x in result:
    if "1079911" in x:
        html=get(x).content.decode("utf-8")
        sel=Selector(text=html)
        title=sel.css("div.title_txtbox::text").extract()[0]
        title=title.replace("：","_")
        info=sel.css("div.content p::text").extract()
        strInfo=""
        for j in info:
            strInfo+=j
        file=open(str.format("{0}{1}",title,".txt"),"w",encoding="utf-8")
        file.write(strInfo)
        file.close()
        #每次操作完休息1~3s
        timeStop=random.randint(1,4)
        time.sleep(timeStop)
        print("完成",title)

等了好久。。。如下如：

9、总结：

a）、CSS选择器的用法千变万化，只有多用才能熟能生巧。

b）、个人建议自己多找几个网站，多试试各种各样的CSS选择器截取需要的信息。

欢迎【点赞】、【评论】、【关注】、【收藏】、【打赏】，为推广知识贡献力量。

猜你喜欢

老司机使用 docker-pan 一键搭建可离线磁力种子的私有云盘,可在线播放预览文件
LiDAR、LAS、LAS Dataset与点云
STM32F303X单片机USB例程详细解析4
Python学习48：定制类
Python算术运算符
Microsoft SDK 中Sample案例之Amcap項目的运行方法(转)
共勉，前浪测试开发给后浪总结的经验，软件测试从业人员都应看看
Android 11.0 去掉无法连接到 WLAN 网络的通知
【python】常用第三方模块
Java实现蓝桥杯基础练习字母图形
【Linux 内核】实时调度类 ⑦ ( 实时调度类核心函数源码分析 | dequeue_task_rt 函数 | 从执行队列中移除进程 )
Java实现 LeetCode 225 用队列实现栈
3GPP TS 23502-g40 中英文对照 | 4.17.2 NF service update
为什么现代系统需要一个新的编程模型？
Acwing——第 87 场周赛
POJ 3190 Stall Reservations
s13.一键安装keepalived脚本
div固定在浏览器顶部
k28.第十二章 K8s高级篇-云原生存储及存储进阶 (二)
收藏 | 阿里程序员常用的 15 款开发者工具（2020 版）
java 数据库查询Date类型字段没有了时分秒全为 00 的解决办法
ML之FE：结合Kaggle比赛的某一案例细究特征工程(Feature Engineering)思路框架
Atitit.pagging 翻页功能解决方案专题与目录大纲 v3 r44.docx
Java实现蓝桥杯VIP 算法训练输出米字形
【习题 6-10 UVA - 246】10-20-30

相关主题

python异步爬虫
[Python]爬虫v0.1
python爬虫入门
Python-Socket通信
python--函数
python--爬虫
Python 爬虫之Scrapy框架
python爬虫(一)
Python 类与对象

zl程序教程