您现在的位置是：首页 > 数据库

当前栏目

scrapy笔记2—实现多级页面信息分别爬取

笔记数据

2023-04-18 14:52:41 时间

yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)
Scrapy 用scrapy.Request发起请求可以带上 meta={'item': item} 把之前已收集到的信息传递到新请求里，在新请求里用 item = response.meta('item') 接受过来，在 item 就可以继续添加新的收集的信息了。
多少级的请求的数据都可以收集。

代码演示如下：

spider模块

# -*- coding: utf-8 -*- import scrapy from ..items import Item方法

class TencentSpider(scrapy.Spider): # 爬虫名称 name = 'xxx' # 允许爬取的域名 allowed_domains = ['www.xxx.com']

# 爬虫入口爬取地址 start_urls = ['https://www.xxx.com/'] # 爬虫爬取页数控制初始值 count = 1 # 爬虫爬取页数 10为只爬取一页 page_end = 1

def parse(self, response):

nodeList = response.xpath( "//table[@class='tablelist']/tr[@class='odd'] | //table[@class='tablelist']/tr[@class='even']") for node in nodeList: item = TencentItem()

item['title'] = node.xpath("./td[1]/a/text()").extract()[0] if len(node.xpath("./td[2]/text()")): item['position'] = node.xpath("./td[2]/text()").extract()[0] else: item['position'] = '' item['num'] = node.xpath("./td[3]/text()").extract()[0] item['address'] = node.xpath("./td[4]/text()").extract()[0] item['time'] = node.xpath("./td[5]/text()").extract()[0] item['url'] = self.base_url + node.xpath("./td[1]/a/@href").extract()[0] # 根据内页地址爬取 yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)

# 有下级页面爬取注释掉数据返回 # yield item

# 循环爬取翻页 nextPage = response.xpath("//a[@id='next']/@href").extract()[0] # 爬取页数控制及末页控制 if self.count < self.page_end and nextPage != 'javascript:;': if nextPage is not None: # 爬取页数控制值自增 self.count = self.count + 1 # 翻页请求 yield scrapy.Request(self.base_url + nextPage, callback=self.parse) else: # 爬虫结束 return None

def detail_parse(self, response): # 接收上级已爬取的数据 item = response.meta['item'] # 一级内页数据提取 item['zhize'] = response.xpath("//*[@id='position_detail']/div/table/tr[3]/td/ul[1]").xpath('string(.)').extract()[0] item['yaoqiu'] = response.xpath("//*[@id='position_detail']/div/table/tr[4]/td/ul[1]").xpath('string(.)').extract()[0] # 二级内页地址爬取 yield scrapy.Request(item['url'] + "&123", meta={'item': item}, callback=self.detail_parse2) # 有下级页面爬取注释掉数据返回 # return item

def detail_parse2(self, response): # 接收上级已爬取的数据 item = response.meta['item'] # 二级内页数据提取 item['test'] = "111111111111111111" # 最终返回数据给爬虫引擎 return item

item模块

# -*- coding: utf-8 -*-

# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item): # define the fields for your item here like: # 职位名称 title = scrapy.Field() # 职位类别 position = scrapy.Field() # 招聘人数 num = scrapy.Field() # 工作地点 address = scrapy.Field() # 发布时间 time = scrapy.Field() # 详情链接 url = scrapy.Field() # 工作职责 zhize = scrapy.Field() # 工作要求 yaoqiu = scrapy.Field() # 测试

猜你喜欢

Android 软键盘的显示和隐藏，这样操作就对了
随手记 Android 沉浸式状态栏的踩坑之路
NoSQL数据库对比：MongoDB vs.Cassandra
信了就上套这些堪称废柴的App你用过几样
Oracle数据库的冷备份及冷备份异地恢复方法
苹果又搞事了，IOS11隐藏新功能来袭
2017年进入尾声，苹果大笔押注的ARkit还好么？
巧用SQL Server 2000的isql进行批量SQL处理
无需Root也能使用Xposed！
专家吐槽iOS 11：数据不安全
SQL Server 2000本地系统账户和域用户账户的选择
SQL Server数据库无法进行远程连接的解决方案
已确定！苹果将彻底封杀32位应用，20万个App将下架
苹果高管：短时间内不会考虑让Face ID支持多用户
51CTO开发者社群管理员招募第四期圆满结束
浅谈Oracle与SQL Server对Update语句的处理
谷歌宣布12月正式发布Android 8.1 安卓8.0适配厂商需加速
编程语言遇上超级英雄,谁才是真的本命?
一个使用JavaBean连接SQL Server 2005数据库的源文件
为什么比起IntelliJ IDEA，我仍然更喜欢Eclipse

zl程序教程

当前栏目

scrapy笔记2—实现多级页面信息分别爬取

相关文章