您现在的位置是：首页 > 后端

当前栏目

成都核酸系统崩了，东软被市民连夜骂上了热榜第一，我用Python爬取了评论区，发现...

Python 系统发现评论第一 ... 爬取核酸

2023-06-13 09:16:05 时间

2022 年 9 月 2 日晚上快 11 点了，打开微博一看话题东软登顶微博热榜第一了。

于是本能的点进话题一探究竟，这里也不多说了，给大家放几张图吧，看了你就明白了。

因为我们是 Python 号，这里用 Python 爬一下东软话题下最热的那条微博评论。

看一下主要实现代码：

# 爬取一页评论内容
def get_one_page(url):
    headers = {
        'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3880.4 Safari/537.36',
        'Host' : 'weibo.cn',
        'Accept' : 'application/json, text/plain, */*',
        'Accept-Language' : 'zh-CN,zh;q=0.9',
        'Accept-Encoding' : 'gzip, deflate, br',
        'Cookie' : '自己的Cookie',
        'DNT' : '1',
        'Connection' : 'keep-alive'
    }
    # 获取网页 html
    response = requests.get(url, headers = headers, verify=False)
    # 爬取成功
    if response.status_code == 200:
        # 返回值为 html 文档，传入到解析函数当中
        return response.text
    return None

# 解析保存评论信息
def save_one_page(html):
    comments = re.findall('<span class="ctt">(.*?)</span>', html)
    for comment in comments[1:]:
        result = re.sub('<.*?>', '', comment)
        if '回复@' not in result:
            with open('comments.txt', 'a+', encoding='utf-8') as fp:
                fp.write(result)

微博评论爬取之前也做过，这里就不详细说了，不了解的小伙伴，可以参考：微博评论爬取。

评论内容爬取完了，这里我们用词云看一下。代码实现如下：

def jieba_():
    stop_words = []
    with open('stop_words.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())
    content = open('comments.txt', 'rb').read()
    # jieba 分词
    word_list = jieba.cut(content)
    words = []
    for word in word_list:
        if word not in stop_words:
            words.append(word)
    global word_cloud
    # 用逗号隔开词语
    word_cloud = '，'.join(words)

def cloud():
    # 打开词云背景图
    cloud_mask = np.array(Image.open('bg.png'))
    # 定义词云的一些属性
    wc = WordCloud(
        # 背景图分割颜色为白色
        background_color='white',
        # 背景图样
        mask=cloud_mask,
        # 显示最大词数
        max_words=200,
        # 显示中文
        font_path='./fonts/simhei.ttf',
        # 最大尺寸
        max_font_size=100
    )
    global word_cloud
    # 词云函数
    x = wc.generate(word_cloud)
    # 生成词云图片
    image = x.to_image()
    # 展示词云图片
    image.show()
    # 保存词云图片
    wc.to_file('melon.png')

看一下效果：

记忆中这种问题也不是第一次了，这里也不多说了。

猜你喜欢

MySQL Error number: MY-011519; Symbol: ER_GRP_RPL_GTID_SET_EXTRACT_ERROR; SQLSTATE: HY000 报错故障修复远程处理
IE安全系列之：中流砥柱（I）—Jscript 5处理浅析
提升生产环境稳定性Redis革命（生产环境 redis）
网友提问："注册接口有个结果计算和短信验证码这样怎么测？"
JS在可编辑的div中的光标位置插入内容的方法
file.getcanonicalpath_maven relativepath
数字档案的安全性和保密性该如何保障？
JavaScript保留两位小数的2个自定义函数
的sed命令Linux中配置文件下的Sed命令使用（linux配置文件中）
Oracle即将断开，再给你几分钟的时间（oracle几分钟后断开）
ORA-48508: Export File Version [string] Can Not be Used by Import [string] ORACLE 报错故障修复远程处理
Pycharm代码docker容器运行调试 | 机器学习系列
新指令 v-memo，提高性能的又一利器
R语言代做编程辅导Big Data Analytics: Assignment – Hurricane Sandy and Flickr（附答案）
asp中的session使用方法详解

zl程序教程

当前栏目

成都核酸系统崩了，东软被市民连夜骂上了热榜第一，我用Python爬取了评论区，发现...

相关文章