您现在的位置是：首页 > 后端

当前栏目

19. python爬虫——基于scrapy框架爬取网易新闻内容

Python 爬虫 scrapy 框架基于内容 19 新闻

2023-09-11 14:20:02 时间

python爬虫——基于scrapy框架爬取网易新闻内容

1、需求
- 【前期准备】
2、分析及代码实现

1、需求

爬取网易新闻的标题和内容

通过网页新闻的首页解析出五大板块对应的详情页的url（可以直接爬取，没有动态内容）
每一个板块对应的新闻标题都是动态加载出来的（动态加载）
通过解析出每一条新闻详情页的url获取详情页的页面源码，从页面源码中解析出新闻的内容

【前期准备】

创建工程文件：scrapy startproject wangyiPro
创建spiders文件：scrapy genspider wangyi www.xxx.com

配置settings.py

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/...'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

2、分析及代码实现

（1）获取五大板块详情页url

在这里插入图片描述
需要获取国内、国际、军事、航空、无人机，五大板块的详情页地址。它们均存在ul下的li标签先
xpath语言：//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li

它们分别位于第3，4，6，7，8个li标签下的a标签href属性中

代码编写

import scrapy


class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    allowed_domains = ['www.xxx.com']
    start_urls = ['https://news.163.com/']
    model_urls = []

    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        alist = [4，5，7，8，9]
        for index in alist:
            model_url = li_list[index].xpath('./a/@href').extract_first()
            self.model_urls.append(model_url)
            print(model_url)

（2）解析每个板块

由于从下载器下载下来的五大板块的详情页数据为静态数据，而目标的实际页面为动态加载的新闻数据。因此需要在引擎和下载器之间的中间件中设置方法来实现获取动态加载新闻数据。
在这里插入图片描述

wangyi.py文件

    #每一个板块对应的新闻标题相关的内容都是动态加载
    def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url
        # response.xpath()
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()


            item = WangyiproItem()
            item['title'] = title

            #对新闻详情页的url发起请求
            yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})

在wangyi.py中添加selenium模块，用于进行动态加载页面的处理。

middlewares.py文件

from scrapy import signals

from scrapy.http import HtmlResponse
from time import sleep

class WangyiproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.


    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    #通过该方法拦截五大板块对应的响应对象，进行篡改
    def process_response(self, request, response, spider):
        #获取在爬虫类定义的浏览器对象
        bro = spider.bro
        #挑选出指定的响应对象进行篡改
        #通过url指定request
        #通过request指定response
        #spider表示的是爬虫对象
        if request.url in spider.models_urls:
            #五个模板对应的url进行请求
            bro.get(request.url)
            sleep(2)
            page_text = bro.page_source #包含了动态加载的新闻数据
            #response #五大板块对应的响应对象
            #针对定位到的这些response进行篡改
            #实例化一个新的响应对象（符合需求：包含动态加载出的新闻数据），替代原来旧的响应对象
            #如何获取动态加载出的新闻数据？
                #基于selenium可便捷的获取动态加载数据
            new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)

            return new_response
        else:
            #response #其他请求对应的响应对象
            return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

在process_response(self,request,response,spider)方法中拦截五大板块的响应对象，对其进行篡改。方法是通过url指定request，再通过request指定response。从spider中获取bro浏览器对象，再从中获取动态加载的页面数据。从而构建出动态加载的页面的新的请求对象。

（3）解析每个模块里的标题中详情页信息

wangyi.py

import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.cccom']
    start_urls = ['https://news.163.com/']
    models_urls = []  #存储五个板块对应详情页的url
    #解析五大板块对应详情页的url

    #实例化一个浏览器对象
    def __init__(self):
        self.bro = webdriver.Chrome(executable_path='D:\WebCrawler\scrapy_program\wangyiPro\wangyiPro\spiders\chromedriver')

    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        alist = [3,4,6,7,8]
        for index in alist:
            model_url = li_list[index].xpath('./a/@href').extract_first()
            self.models_urls.append(model_url)

        #依次对每一个板块对应的页面进行请求
        for url in self.models_urls:#对每一个板块的url进行请求发送
            yield scrapy.Request(url,callback=self.parse_model)

    #每一个板块对应的新闻标题相关的内容都是动态加载
    def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url
        # response.xpath()
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()


            item = WangyiproItem()
            item['title'] = title

            #对新闻详情页的url发起请求
            yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
    def parse_detail(self,response):#解析新闻内容
        content = response.xpath('//*[@id="endText"]//text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content

        yield item


    def closed(self,spider):
        self.bro.quit()

items.py

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()

pipelines.py

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()

猜你喜欢

智能永动车
HTML5 中list 和datalist实例
智能指针（一）：auto_ptr浅析
VS F5不编译 F5总是重新编译
大数据助力制造业传承：像经营企业那样去经营数据
Python之collections模块详细实例
oracle rac常用命令
Python网络爬虫 - 1. 准备工作
《MySQL排错指南》——4.3　其他软件影响
win11 9月累积更新补丁KB5017328(22000.978)发布！
移动端填充满屏展示
MySQL: Replication
深入理解JVM虚拟机读书笔记——运行时栈帧结构
Unity优化之GC——合理优化Unity的GC
Django rest framework 源码分析（1）----认证

相关主题

python简单爬虫
Python-文件操作
Python 操作SQLite数据库
python爬虫框架
爬虫与反爬虫
python堆排序
python-小算法
python爬虫-scrapy

zl程序教程