scrapy 图片管道学习笔记
使用scrapy首先需要安装
python环境使用3.6
windows下激活进入python3.6环境
activate python36
mac下
mac@macdeMacBook-Pro:~$ source activate python36
(python36) mac@macdeMacBook-Pro:~$
安装 scrapy
(python36) mac@macdeMacBook-Pro:~$ pip install scrapy (python36) mac@macdeMacBook-Pro:~$ scrapy --version Scrapy 1.8.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command (python36) mac@macdeMacBook-Pro:~$ scrapy startproject images New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/mac/images You can start your first spider with: cd images scrapy genspider example example.com (python36) mac@macdeMacBook-Pro:~$ cd images (python36) mac@macdeMacBook-Pro:~/images$ scrapy genspider -t crawl pexels www.pexels.com Created spider 'pexels' using template 'crawl' in module: images.spiders.pexels (python36) mac@macdeMacBook-Pro:~/images$
setting.py里面 关闭robot.txt遵循
ROBOTSTXT_OBEY = False
分析目标网站规则 www.pexels.com
https://www.pexels.com/photo/man-using-black-camera-3136161/
https://www.pexels.com/video/beach-waves-and-sunset-855633/
https://www.pexels.com/photo/white-vehicle-2569855/
https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/
得出要抓取的规则
rules = (
Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)
图片管道 要定义两个item
class ImagesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
images_url是抓取到的图片url 需要传递过来
images 检测图片完整性,但是我打印好像没看到这个字段
pexels.py里面引入item 并且定义对象
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from images.items import ImagesItem class PexelsSpider(CrawlSpider): name = 'pexels' allowed_domains = ['www.pexels.com'] start_urls = ['http://www.pexels.com/'] rules = ( Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False), ) def parse_item(self, response): item = ImagesItem() item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract() print(item['image_urls']) return item
设置setting.py里面启用图片管道 设置存储路劲
ITEM_PIPELINES = { #'images.pipelines.ImagesPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = '/www/crawl' # 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载 IMAGES_URLS_FIELD = 'image_urls'
启动爬虫
scrapy crawl pexels --nolog
发现已经下载下来了
但是下载的图片不是高清的,要处理下图片的后缀
setting.py打开默认管道 设置优先级高一些
ITEM_PIPELINES = { 'images.pipelines.ImagesPipeline': 1, 'scrapy.pipelines.images.ImagesPipeline': 2 }
管道文件里面对后缀进行处理去掉
class ImagesPipeline(object): def process_item(self, item, spider): tmp = item['image_urls'] item['image_urls'] = [] for i in tmp: if '?' in i: item['image_urls'].append(i.split('?')[0]) else: item['image_urls'].append(i) return item
最终下载的就是大图了,但是图片管道还是默认对图片会有压缩的,所以如果使用文件管道下载的才是完全的原图,非常大。
如果不下载图片,直接存图片url到mysql的话参考
https://www.cnblogs.com/php-linux/p/11792393.html
图片管道 配置最小宽度和高度分辨率
IMAGES_MIN_HEIGHT=800
IMAGES_MIN_WIDTH=600
IMAGES_EXPIRES=90 天 不会对重复的进行下载
生成缩略图
IMAGES_THUMBS={
‘small’:(50,50),
'big':(600,600)
}
相关文章
- 第三百七十节,Python分布式爬虫打造搜索引擎Scrapy精讲—elasticsearch(搜索引擎)用Django实现搜索结果分页
- 第三百五十七节,Python分布式爬虫打造搜索引擎Scrapy精讲—利用开源的scrapy-redis编写分布式爬虫代码
- 第三百四十五节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫和反爬的对抗过程以及策略—scrapy架构源码分析图
- 第三百二十四节,web爬虫,scrapy模块介绍与使用
- scrapy 结合selenium
- scrapy通过修改配置文件发送状态邮件
- 转 Scrapy笔记(5)- Item详解
- Python爬虫:Scrapy中间件Middleware和Pipeline
- VSCode中设置Python解释器运行Scrapy
- 爬虫日记(89):Scrapy的DownloadHandlers类
- 爬虫日记(75):Scrapy的Settings源码分析
- 爬虫日记(12):scrapy提取数据的技巧
- scrapy 6023 telnet查看爬虫引擎相关状态
- scrapy爬虫实现爬取图片(通过图片管道)
- scrapy 采集数据存入excel
- scrapy文件管道
- Crawler之Scrapy:基于scrapy框架实现爬虫两个网址下载网页内容信息之详细攻略