您现在的位置是：首页 > 后端

当前栏目

scrapy 图片管道学习笔记

scrapy 笔记学习图片管道

2023-09-14 09:12:12 时间

使用scrapy首先需要安装

python环境使用3.6

windows下激活进入python3.6环境

activate python36

mac下

mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$

安装 scrapy

(python36) mac@macdeMacBook-Pro:~$     pip install scrapy
(python36) mac@macdeMacBook-Pro:~$     scrapy --version
Scrapy 1.8.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
(python36) mac@macdeMacBook-Pro:~$     scrapy startproject images
New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/mac/images

You can start your first spider with:
    cd images
    scrapy genspider example example.com

(python36) mac@macdeMacBook-Pro:~$     cd images
(python36) mac@macdeMacBook-Pro:~/images$     scrapy genspider -t crawl pexels www.pexels.com
Created spider 'pexels' using template 'crawl' in module:
  images.spiders.pexels
(python36) mac@macdeMacBook-Pro:~/images$

setting.py里面关闭robot.txt遵循

ROBOTSTXT_OBEY = False

分析目标网站规则 www.pexels.com

https://www.pexels.com/photo/man-using-black-camera-3136161/

https://www.pexels.com/video/beach-waves-and-sunset-855633/

https://www.pexels.com/photo/white-vehicle-2569855/

https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/

得出要抓取的规则

rules = (
    Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)


图片管道 要定义两个item

class ImagesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

images_url是抓取到的图片url 需要传递过来

images 检测图片完整性，但是我打印好像没看到这个字段

pexels.py里面引入item 并且定义对象

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from images.items import ImagesItem

class PexelsSpider(CrawlSpider):
    name = 'pexels'
    allowed_domains = ['www.pexels.com']
    start_urls = ['http://www.pexels.com/']

    rules = (
        Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = ImagesItem()
        item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract()
        print(item['image_urls'])
        return item

设置setting.py里面启用图片管道设置存储路劲

ITEM_PIPELINES = {
   #'images.pipelines.ImagesPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1
}



IMAGES_STORE = '/www/crawl'
# 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载
IMAGES_URLS_FIELD = 'image_urls'

启动爬虫

scrapy crawl pexels --nolog

发现已经下载下来了

但是下载的图片不是高清的，要处理下图片的后缀

setting.py打开默认管道设置优先级高一些

ITEM_PIPELINES = {
    'images.pipelines.ImagesPipeline': 1,
    'scrapy.pipelines.images.ImagesPipeline': 2
}

管道文件里面对后缀进行处理去掉

class ImagesPipeline(object):
    def process_item(self, item, spider):
        tmp = item['image_urls']
        item['image_urls'] = []
        for i in tmp:
            if '?' in i:
                item['image_urls'].append(i.split('?')[0])
            else:
                item['image_urls'].append(i)

        return item

最终下载的就是大图了，但是图片管道还是默认对图片会有压缩的，所以如果使用文件管道下载的才是完全的原图，非常大。

如果不下载图片，直接存图片url到mysql的话参考

https://www.cnblogs.com/php-linux/p/11792393.html

图片管道配置最小宽度和高度分辨率

IMAGES_MIN_HEIGHT=800

IMAGES_MIN_WIDTH=600

IMAGES_EXPIRES=90 天不会对重复的进行下载

生成缩略图

IMAGES_THUMBS={

　　‘small’:(50,50),

'big':(600,600)

}

猜你喜欢

Google Chart API
SAP Marketing Cloud 功能概述(四)
Crack:Nevron Open Vision for .NET 2022.3 ( 22.10.19.12)
XML 命名空间（XML Namespaces）
LeetCode-1234. 替换子串得到平衡字符串【滑动窗口，字符串】
SpriteBuilder中频繁的切换场景层的解决办法
数学建模学习（78）：多输入多输出回归预测模型（结合XGBoost实现）
如何使用代码区分service contract和service contract quotation
学习书籍
MySQL的视图总结
INVALID_HANDLE_VALUE 、 NULL、nullptr 和 nullptr_t 的联系
【Linux基础】crontab定时命令详解
使用springMVC实现文件上传和下载之环境配置与上传
终于跑通分布式事务框架tcc-transaction的示例项目
KVM 虚拟机调整内存与CPU
How to handle Imbalanced Classification Problems in machine learning?

相关主题

scrapy 教程
scrapy框架爬虫
Scrapy学习
Scrapy笔记
小刮刮Scrapy
scrapy入门
Python之scrapy框架
安装Scrapy
笔记笔记笔记
笔记笔记
Scrapy 框架介绍

zl程序教程

当前栏目

scrapy 图片管道学习笔记

相关文章