您现在的位置是：首页 > Python

当前栏目

爬虫实现二级链接页面信息爬取

2023-04-18 14:52:39 时间

一.scrapy环境搭建,参考我的博客–>爬虫框架虚拟环境搭建

二.scrapy设置配置

1.设置用户代理

进入页面并刷新,进入开发者模式,点击选中一个网页,在Network-Headers中找到USER_AGENT,并复制就可以了.

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

2.是否遵守爬虫协议改为否(原因你懂的)

ROBOTSTXT_OBEY = False

3.一次允许的最大请求数

# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 2

4.设置下载延迟时间,因而使得爬虫更像是人的行为,避免IP被屏蔽

DOWNLOAD_DELAY = 3

5.设置下载中间键

DOWNLOADER_MIDDLEWARES = { 'xymtest.middlewares.XymtestDownloaderMiddleware': 543, }

6.设置管道

ITEM_PIPELINES = { 'xymtest.pipelines.XymtestPipeline': 300, }

7.取消最后几行的注释

HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

三.开始编写爬虫代码

1.设置要爬取得items

# define the fields for your item here like: title = scrapy.Field() content = scrapy.Field()

2.在spider下创建一个.py文件,编写爬虫代码.

class testInformation(scrapy.Spider): name = 'test' #域名后面跟的一串数字不要跟上来 allowed_domains = ['blog.test.net'] #the format of different page's address is https://blog.test.net/u42/article/list/ follows with a number,https://blog.test.net/u042/article/list/1 means the first page, https://blog.test.net/u014229742/article/list/2 is the second.so we can use a the same part plus a number,but we can not add a int type with a string,so we change the number to str

start_urls = ['https://blog.test.net/u42/article/list/' + str(x) for x in range(1, 4)]

#函数 def parse(self, response):

#get the xpath of the title: #the first title xpath is://*[@id="mainBox"]/main/div[2]/div[1]/h4/a #the seconde title xpath is://*[@id="mainBox"]/main/div[2]/div[2]/h4/a #the same part is://*[@id="mainBox"]/main/div[2],and from the next div everything is different.div[1] means the first title's xpath.div[2] means the second title's xpath. #so if when want to get all the xpath,we can use://*[@id="mainBox"]/main/div[2]/div #获取到所有标题的xpath li_list = response.xpath('//*[@id="mainBox"]/main/div[2]/div')

#we have to get all the title,so there must has a recycle,xq means one of the for xq in li_list: item = XymtestItem() #获取到标题内容//*[@id="mainBox"]/main/div[2]/div[1]/h4/a/text() #//*[@id="mainBox"]/main/div[2]/div[2]/h4/a/text() #获取到的标题去掉li_list中的公共部分 item_list = xq.xpath('h4/a/text()').extract() #因为获取到的item_list有空的内容,如果直接extract()[0],会报错,故先判断长度,长度不为空,开始取标题 if len(item_list) > 0: #strip()函数可以去除空格 item['title'] = item_list[1].strip() #获取到每个标题的href内容 url = xq.xpath('h4/a/@href').extract()[0] #Request(url, meta={'item': item}, callback = self.parse_detail)方法实现二层链接函数的调取 yield Request(url, meta={'item': item}, callback = self.parse_detail)

def parse_detail(self, response):

item = response.meta['item'] #获取到二层链接中要爬取的页面的xpath item['content'] = response.xpath('//*[@id="mainBox"]/main/div[1]/div[2]/div/div/span/text()').extract()[0]

yield item 好了,以上代码基本实现了一个二层链接的爬取,接下来要做的事将爬取到的数据存储到数据库中供我们使用.想知道更多,继续关注小姐姐!

猜你喜欢

Jease 2.6发布 Java开源内容框架
EasyCVR对接华为iVS订阅摄像机和用户变更请求接口介绍
JVM调优总结：反思
【技术种草】cdn+轻量服务器+hugo=让博客“云原生”一下
JVM调优总结：调优方法
前端面试【JavaScript】— typeof 是否能正确判断类型？
JVM调优总结：新一代的垃圾回收算法
前端面试【JavaScript】— instanceof 能否判断基本数据类型？
JVM调优总结：典型配置举例
前端面试【JavaScript】— 能不能手动实现一下 instanceof 的功能？
前端面试【JavaScript】— Object.is和=== 有什么区别？
JVM调优总结：分代垃圾回收详述
前端面试【JavaScript】— JS中类型转换有哪几种？
WPF开发入门尝试
前端面试【JavaScript】— == 和 ===有什么区别？
一个Java程序员对2011年的回顾
前端面试【JavaScript】— 对象转原始类型是根据什么流程运行的？
JVM调优总结：垃圾回收面临的问题
直接在代码里面对list集合进行分页
JVM调优总结：基本垃圾回收算法

zl程序教程

当前栏目

爬虫实现二级链接页面信息爬取

一.scrapy环境搭建,参考我的博客–>爬虫框架虚拟环境搭建

二.scrapy设置配置

三.开始编写爬虫代码

相关文章