18. python爬虫——基于scrapy框架设置下载器中间件中的UA伪装和代理IP
2023-09-11 14:20:02 时间
中间件
- 下载中间件
- 位置:引擎和下载器之间
- 作用:批量拦截到整个工程中所有的请求和响应
- 拦截请求:
(1)UA伪装:process_request
(2)代理IP设定:process_exception:return request - 拦截响应:
篡改响应数据,响应对象
【前期准备】
创建工程文件:
scrapy startproject middlePro
创建spiders
:
scrapy genspiders middle www.xxx.com
文件结构如下:
代码部分
中间件文件
更改middlewares.py
为
import random
class MiddleproDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
#拦截请求
def process_request(self, request, spider):
#UA伪装
request.headers['User-Agent'] = random.choice(self.user_agent_list)
#为了验证代理的操作是否生效
request.meta['proxy'] = 'http://183.146.213.198:80'
return None
#拦截所有的响应
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
#拦截发生异常的请求
def process_exception(self, request, exception, spider):
if request.url.split(':')[0] == 'http':
#代理
request.meta['proxy'] = 'http://'+random.choice(self.PROXY_http)
else:
request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)
return request #将修正之后的请求对象进行重新的请求发送
- 数据:需要UA池、http、https代理列表
- 方法:拦截请求(process_request)、拦截所有的响应(process_response)、拦截发生异常的请求(process_exception)
更改settings.py
为
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}
从而开启配件
更改middle.py
为
import scrapy
class MiddleSpider(scrapy.Spider):
#爬取百度
name = 'middle'
# allowed_domains = ['www.xxxx.com']
start_urls = ['http://www.baidu.com/s?wd=ip']
def parse(self, response):
page_text = response.text
with open('./ip.html','w',encoding='utf-8') as fp:
fp.write(page_text)
将结果保存至本地
相关文章
- 【Python成长之路】python 基础篇 -- 装饰器【华为云分享】
- 【Python】彩色输出
- Python 日期和时间_python 当前日期时间_python日期格式化
- python工具——pypinyin
- 【Python五篇慢慢弹(3)】函数修行知python
- Python 字符串_python 字符串截取_python 字符串替换_python 字符串连接
- Python爬虫开发:requests库的使用--ip代理参数的设置
- Python爬虫开发:ip代理的使用
- 【Python基础】python爬虫之异步网络爬虫ǃ
- Python编程语言学习:python编程语言中重要函数讲解之map函数等简介、使用方法之详细攻略
- Python:利用python语言绘制多个子图经典案例、代码实现之详细攻略
- Python语言学习之数值、小数、空格那些事:python和数值、小数、空格的使用方法之详细攻略
- 已解决2. Set PROTOCOL_BUPFERS_PYTHON_iMPLEMENTATION=python (but this will use pure-Python parsing and w
- 4行Python代码生成图像验证码
- Python编程:搭建一个爬虫代理池
- web自动化测试入门 —— selenium+python基础方法封装
- python中访问目录
- python 爬虫之requests模块设置代理
- Python kafka操作实例(kafka-python)
- 【python】Python实现网络爬虫demo实例
- 【Python实战】 ---- python 自带的 venv 虚拟环境更新 pip 失败
- python word2vector (一)