您现在的位置是：首页 > 后端

当前栏目

一个人，一座城，你到底在乎什么？Python 爬虫告诉你

Python 爬虫一个什么到底告诉

2023-09-11 14:15:51 时间

资源分享【这份资料必须领取~】

大家好！

有时候，我们想知道生活在这座城市的人每天交流的事情，然后对数据进行一些分析，方便我们更好地了解城市的特征及居民的需求

以重庆为例，最火爆的论坛是购物狂，每天都有大量的帖子内容产生，基本上囊括了重庆人的衣食住行

本篇文章将介绍使用 Scrapy 爬取该论坛数据的完整教程

1. 爬虫

目标对象：

aHR0cHM6Ly9nby5jcW1tZ28uY29tL3RocmVhZC9sYXN0ZXN0P3BhZ2U9MQ==

我们需要爬取所有页面帖子中所有的回帖信息

1-1 安装依赖

# 安装依赖pip3 install Scrapy# 生成词云pip3 install jiebapip3 install stylecloud

1-2 创建项目及爬虫

# 创建项目scrapy startproject cq_all# 创建一个 CrawlSpider 爬虫cd cq_allscrapy genspider -t crawl talk "HOST"

1-3 定义 Item 实体对象

在 items.py 文件中，将需要爬取的数据定义为 Item

这里只需要爬取帖子的标题、URL 及所有回帖内容

# items.pyimport scrapyclass CqAllItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    title = scrapy.Field()  # 帖子标题    url = scrapy.Field()  # URL    content = scrapy.Field()  # 内容

1-4 编写爬虫

在 spiders 文件夹下的爬虫文件中编写具体的爬虫逻辑

首先通过定义 rules，指定要爬取的地址，然后使用 Xpath 语法解析出所有文本内容

PS：通过设定 follow 为 True，可以爬取某一个帖子所有的回帖数据

# talk.pyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass TalkSpider(CrawlSpider):    name = 'talk_all'    allowed_domains = ['HOST']    start_urls = ['https://HOST/thread/lastest?page={}'.format(i + 1) for i in range(20)]    rules = (        Rule(LinkExtractor(allow=r'https://HOST/forum-\d+-thread-\S+.html', deny=r'safeRedirect'),             callback='parse_forum',             follow=True),    )        def parse_forum(self, response):        """        解析帖子        :param response:        :return:        """        try:            # 帖子标题            title = response.xpath('//*[@id="view-hd"]//span/text()').extract()[0]            # URL            url = response.url            # 获取内容，并获取其真实文字内容            content = get_content(                response.xpath('//*[@id="view-bd"]//table[@class="view-data"]//div//text()').extract())            # 导航标题栏            tab_titles = response.xpath('//*[@id="nav"]//a/text()').extract()            # 过滤特殊帖子，比如：公告            if not in_exclude_tabs(tab_titles):                if content:                    # print("帖子URL：", url, "，标题：", title, "，内容:", content)                    item = CqAllItem()                    item['title'] = title                    item['url'] = url                    item['content'] = content                    yield item                else:                    # print("（过滤掉）帖子URL：", response.url, "内容为空")                    pass            else:                # print("（过滤掉）过滤的帖子，Tab:", tab_titles, "，地址：", response.url)                pass        except Exception as e:            print("产生异常，异常信息:", str(e))

为了过滤无效信息，这里对回帖内容及 Tab 进行一次初步过滤

# 排除的Tab，过滤无效信息exclude_tabs = ['公告','网友中心']def get_content(contents):    """    去掉空格换行符    :param contents:    :return:    """    result = ''    for item in contents:        result += item.strip()    return resultdef in_exclude_tabs(tab_titles):    result = False    for tab_title in tab_titles:        if tab_title in exclude_tabs:            result = True            break    return result

1-5 反反爬

为了应对网站的反爬，这里需要进行如下处理

下载延迟

自定义请求头

禁用重定向

# settings.py# Obey robots.txt rulesROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = 1# 禁用重定向REDIRECTS_ENABLED = False# talk.py# 自定义请求头custom_settings = {        "COOKIES_ENABLED": False,        "DOWNLOAD_DELAY": 3,        'DEFAULT_REQUEST_HEADERS': {            'Accept': '*/*',            'Accept-Encoding': 'gzip, deflate, br',            'Accept-Language': 'zh-CN,zh;q=0.9',            'Connection': 'keep-alive',            'Content-type': 'application/x-www-form-urlencoded;',            'Host': 'go.cqmmgo.com',            'Origin': 'https://HOST',            'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',            'sec-ch-ua-mobile': '?0',            'sec-ch-ua-platform': "Windows",            'Sec-Fetch-Mode': 'cors',            'Sec-Fetch-Site': 'same-origin',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',            'Referer': 'https://HOST/',            'Cookie': '*',        }    }

1-6 自定义下载管道 Pipline

在 piplines.py 文件中，自定义 2 个下载管道，分别将数据写入到本地 CSV 文件和文本文件中

from scrapy.exceptions import DropItemfrom scrapy.exporters import CsvItemExporterfrom cq_all.items import CqAllItemclass DuplicatesPipeline(object):    """    Pipline去重复的帖子    """    def __init__(self):        self.talk_set = set()    def process_item(self, item, spider):        name = item['title']        if name in self.talk_set:            raise DropItem("Duplicate book found:%s" % item)        self.talk_set.add(name)        return itemclass CqAllPipeline(object):    """保存CSV"""    def __init__(self):        self.file = open("./cq_all.csv", 'wb')        self.exporter = CsvItemExporter(self.file, fields_to_export=[            'title', 'url', 'content'        ])        self.exporter.start_exporting()    def process_item(self, item, spider):        if isinstance(item, CqAllItem):            self.exporter.export_item(item)        return item    # 关闭资源    def close_spider(self, spider):        self.exporter.finish_exporting()        self.file.close()class CqAllTxtPipline(object):    """    保存到txt文件中    """    def __init__(self):        self.file = open('result.txt', 'w', encoding='utf-8')    def open_spider(self, spider):        # self.file = open('result.txt', 'w', encoding='utf-8')        pass    def close_spider(self, spider):        self.file.close()    def process_item(self, item, spider):        try:            if isinstance(item, CqAllItem):                self.file.write(item['content'] + '\n')        except:            pass

1-7 配置爬虫配置文件

打开 settings.py 文件，配置默认请求头及数据管道

# settings.py# Override the default request headers:DEFAULT_REQUEST_HEADERS = {    'Accept': '*/*',    'Accept-Encoding': 'gzip, deflate, br',    'Accept-Language': 'zh-CN,zh;q=0.9',    'Connection': 'keep-alive',    'Content-type': 'application/x-www-form-urlencoded;',    'Host': 'HOST',    'Origin': 'https://HOST',    'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',    'sec-ch-ua-mobile': '?0',    'sec-ch-ua-platform': "Windows",    'Sec-Fetch-Mode': 'cors',    'Sec-Fetch-Site': 'same-origin',    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',    'Referer': 'https://HOST/'}ITEM_PIPELINES = {    'cq_all.pipelines.DuplicatesPipeline': 1,    'cq_all.pipelines.CqAllPipeline': 2,    'cq_all.pipelines.CqAllTxtPipline': 3,}

1-8 爬虫主入口

在爬虫项目根目录下创建一个文件，通过下面的方式运行单个爬虫

# main.pyfrom scrapy.cmdline import executeimport sys, osdef start_scrapy():    sys.path.append(os.path.dirname(__file__))    # 运行单个爬虫    execute(["scrapy", "crawl", "talk_all"])def pre():    """    删除临时文件    :return:    """    if os.path.exists('cq.png'):        os.remove("cq.png")    if os.path.exists('result.txt'):        os.remove('result.txt')if __name__ == '__main__':    pre()    start_scrapy()

2. 词云可视化

在爬虫类重写 close 方法，根据本地文件绘制成词云图片保存到本地

# talk.pyclass TalkSpider(CrawlSpider):  def close(spider, reason):    # 绘制词图    gene_word_cloud()    return None

2-1 内容预处理

从爬取文本中读取数据后进行预处理，过滤一些无用的字符

def gene_word_cloud():    """根据文件生成词云"""    with open('result.txt', 'r', encoding='utf-8') as f:        data = f.read()        # 文本预处理        new_data = re.findall('[\u4e00-\u9fa5]+', data, re.S)        new_data = "/".join(new_data)

2-2 分词后去除单词

接着使用 jieba 进行分词，然后过滤掉单词

import jieba, stylecloud# 解析word_list_pre = jieba.cut(new_data, cut_all=True)# 去除单词word_list = [item for item in word_list_pre if len(item) > 1]result = " ".join(word_list)  # 分词用空格隔开

2-3 停用词汇

为了保证生成词云的有效性，我们一般需要自定义停用词汇

这里推荐下面 2 个词汇，大家可以在此基础上进行二次编辑

https://github.com/goto456/stopwords/blob/master/cn_stopwords.txt

https://github.com/fwwdn/sensitive-stop-words/blob/master/stopword.dic

def get_cn_stop_words():    """    从文件中读取停用中文词    :return:    """    stop_words = set()    # 停用词汇    # https://github.com/fwwdn/sensitive-stop-words/blob/master/stopword.dic    # https://github.com/goto456/stopwords/blob/master/cn_stopwords.txt    with open('cn_stopwords.txt', encoding='utf-8') as f:        con = f.readlines()        for i in con:            i = i.replace("\n", "")  # 去掉读取每一行数据的\n            if i:                stop_words.add(i)    return list(stop_words)

2-4 绘制词云

使用 stylecloud 绘制词云的时候，我们需要指定图片大小、字体、调色方案、方向、蒙版、停用词汇等参数

其中

调色方案参考：

https://jiffyclub.github.io/palettable/cartocolors/qualitative/

词云蒙版参考：

https://fa5.dashgame.com/#/

import jieba, stylecloudimage_path = 'cq.png'    # 中文禁用词    stop_words = get_cn_stop_words()    stylecloud.gen_stylecloud(        text=result,  # 上面分词的结果作为文本传给text参数        size=1024,        font_path='msyh.ttc',  # 字体设置        palette='cartocolors.qualitative.Pastel_10',  # 调色方案选取，从palettable里选择        gradient='horizontal',  # 渐变色方向选了垂直方向        icon_name='fas fa-dove',  # 蒙版选取，从Font Awesome里选        stopwords=True,  # 布尔值，用于筛除常见禁用词        max_words=200,        collocations=False,        custom_stopwords=stop_words,        output_name=image_path)

2-5 消息推送

我们读取本地图片，使用企业微信机器人发送出去

import re, base64, hashlibimport requestsdef send_wx():    """    将词云图片发送到微信    :return:    """    with open(image_path, 'rb') as file:  # 转换图片成base64格式        data = file.read()        image_data = str(base64.b64encode(data), 'utf-8')    with open(image_path, 'rb') as file:  # 图片的MD5值        md = hashlib.md5()        md.update(file.read())        image_md5 = md.hexdigest()    url = 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xx'    headers = {"Content-Type": "application/json"}    data = {        "msgtype": "image",        "image": {            "base64": image_data,            "md5": image_md5        }    }    result = requests.post(url, headers=headers, json=data)    return result

3. 最后

在本机运行成功之后，我们就可以将程序部署到服务器了

需要注意的是，服务器一般没有中文字体，我们需要额外安装字体

以 CentOS 为例，我们首先需要安装 fontconfig 和 ttmkfdir，然后下载中文字体放置到中文字体目录下，接着修改配置文件 fonts.conf，最后刷新字体缓存即可

猜你喜欢

Swiper
BackPropagation
企业构建ERP系统不得不考虑的因素
第77节:Java中的事务和数据库连接池和DBUtiles
SpringBoot项目@RestController使用 redirect 重定向无效
【C++】从C到C++
程序人生 | 一名软件测试，怎么样达到月薪2w+?
尤金卡巴斯基:网络安全已陷“黑暗时代”,我们该如何应对?
浅析JavaScript的7种异常类型及如何快速排查
笔试题-2023-复睿微-数字芯片设计【纯净题目版】
基于QT（C++）+SQL Server 2008 实现相机租赁系统【100010742】
When using 'npm' it requires me to login to github
SpringCloud之Hystrix：集群容错框架
linux内核睡眠状态解析

相关主题

Python爬虫教程
Python爬虫基础2
python 匿名函数
python-爬虫
python爬虫框架
【python】协程
python:爬虫入门
聊聊Python