zl程序教程

您现在的位置是:首页 >  大数据

当前栏目

相亲、相亲,广大年轻人的噩梦,那么我们就来采集一下相亲网站数据叭~

数据网站 我们 采集 一下 那么 年轻人
2023-09-14 09:05:34 时间

前言 😋

大家早好、午好、晚好吖~

环境开发:

  • Python 3.8

  • Pycharm

模块使用:

  • requests

  • parsel

  • csv

代码实现步骤:

  1. 发送请求, 模拟浏览器对于url地址发送请求

  2. 获取数据, 获取服务器返回响应数据 ----> 对应 开发者工具里面 response

  3. 解析数据, 提取我们想要数据内容 基本信息

  4. 保存数据, 保存表格里面 / 图片可以保存到文件夹里面

代码

# 导入数据请求模块  ---> 第三方模块 需要cmd里面 pip install requests
import requests
# 导入数据解析模块  ---> 第三方模块 需要cmd里面 pip install parsel
import parsel
# 导入csv模块 ---> 内置模块 不需要安装
import csv

完整源码、解答、教程加Q裙:261823976 点击蓝字加入【python学习裙】

请添加图片描述

# 创建文件
f = open('对象_1.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '标题',
    '幸运号',
    '性别',
    '年龄',
    '星座',
    '年薪',
    '学历',
    '身高',
    '爱情宣言',
    '照片',
    '详情页',
])
# 写入表头
csv_writer.writeheader()

# 网址 列表页面url
link = 'https://www.19lou.com/r/1/19lnsxq-3.html'
# 模拟浏览器headers
headers = {
    'Cookie': '_Z3nY0d4C_=37XgPK9h; _DM_SID_=abfbcfb2fade7d35ee39c33b5eef7e13; screen=2543; pm_count=%7B%7D; dayCount=%5B%5D; cuid=Hd93N5CDQEk5bODgyK4cOrzXujbQHL84; JSESSIONID=370A8DC7AD014A912504354C3491C5F5; f39big=ip53; f9big=u87; _DM_S_=dc952385e06e9ac73264931ecd4bd0bc; Hm_lvt_5185a335802fb72073721d2bb161cd94=1659515619,1659592454,1659611492; fr_adv=bbs_huatan_ck; fr_adv_last=merry_thread_pc; _dm_userinfo=%7B%22uid%22%3A0%2C%22stage%22%3A%22%22%2C%22city%22%3A%22%E6%B9%96%E5%8D%97%3A%E9%95%BF%E6%B2%99%22%2C%22ip%22%3A%22175.0.62.249%22%2C%22sex%22%3A%221%22%2C%22frontdomain%22%3A%22www.19lou.com%22%2C%22category%22%3A%22%E6%83%85%E6%84%9F%2C%E5%A9%9A%E5%BA%86%2C%E6%97%B6%E5%B0%9A%22%7D; _dm_tagnames=%5B%7B%22k%22%3A%2219%E6%A5%BC%E5%A5%B3%E7%94%9F%E7%9B%B8%E4%BA%B2%22%2C%22c%22%3A29%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A31%7D%2C%7B%22k%22%3A%22%E7%A1%95%E5%A3%AB%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E7%A7%A4%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A21%7D%2C%7B%22k%22%3A%22%E7%9B%B8%E4%BA%B2%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E7%9B%B8%E4%BA%B2%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E5%BE%81%E5%A9%9A%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E8%9D%8E%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%221986%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9C%AC%E7%A7%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E7%A6%BB%E5%BC%82%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%5D; Hm_lpvt_5185a335802fb72073721d2bb161cd94=1659619705',
    'Host': 'www.19lou.com',
    'Referer': 'https://www.19lou.com/r/1/19lnsxq-4.html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
}
# 发送请求
response_1 = requests.get(url=link, headers=headers)
# 获取数据 print(response_1.text)
# 解析数据
selector_1 = parsel.Selector(response_1.text)
# css提取内容
title_list = selector_1.css('.item-hd h3::text').getall()  # 获取标题
# 获取链接
href = selector_1.css('.item-bd .cont a::attr(href)').getall()
# for循环
for title, index in zip(title_list, href):
    # 把http替换成https
    url = index.replace('http:', 'https:')
    """
    1. 发送请求, 模拟浏览器对于url地址发送请求
        - python代码 如何模拟浏览器发送请求
            请求头 是字典数据类型, 我们构建完整键值对形式
        - 如何替换内容
            ctrl + R 会弹出框框 输入正则命令
            (.*?): (.*)
            '$1': '$2',
        - <Response [200]> 表示请求成功
            但是不代表你得到数据...
        - response = requests.get(url=url, headers=headers)
            response 自定义变量 自己定义变量
            requests.get() 调用requests模块里面get方法
            url=url 左边url是get函数里面形式参数 右边url是我们传递进去的参数
    
    """
    # 确定请求url地址
    # url = 'https://www.19lou.com/forum-164-thread-83331619167048422-1-1.html'
    # 模拟浏览器发送请求 headers请求头
    headers = {
        'Cookie': '_Z3nY0d4C_=37XgPK9h; _DM_SID_=abfbcfb2fade7d35ee39c33b5eef7e13; screen=2543; pm_count=%7B%7D; dayCount=%5B%5D; cuid=Hd93N5CDQEk5bODgyK4cOrzXujbQHL84; JSESSIONID=370A8DC7AD014A912504354C3491C5F5; f39big=ip53; f9big=u87; _DM_S_=dc952385e06e9ac73264931ecd4bd0bc; Hm_lvt_5185a335802fb72073721d2bb161cd94=1659515619,1659592454,1659611492; fr_adv=bbs_huatan_ck; _dm_tagnames=%5B%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A30%7D%2C%7B%22k%22%3A%2219%E6%A5%BC%E5%A5%B3%E7%94%9F%E7%9B%B8%E4%BA%B2%22%2C%22c%22%3A27%7D%2C%7B%22k%22%3A%22%E7%A1%95%E5%A3%AB%22%2C%22c%22%3A2%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E7%A7%A4%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A5%B3%E7%94%9F%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A21%7D%2C%7B%22k%22%3A%22%E7%9B%B8%E4%BA%B2%E8%AE%BA%E5%9D%9B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E7%9B%B8%E4%BA%B2%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9D%AD%E5%B7%9E%E5%BE%81%E5%A9%9A%E7%BD%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%A4%A9%E8%9D%8E%E5%BA%A7%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%221986%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%9C%AC%E7%A7%91%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E7%A6%BB%E5%BC%82%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E6%81%8B%E7%88%B1%22%2C%22c%22%3A1%7D%2C%7B%22k%22%3A%22%E5%BE%81%E5%8F%8B%22%2C%22c%22%3A1%7D%5D; _dm_userinfo=%7B%22uid%22%3A0%2C%22stage%22%3A%22%22%2C%22city%22%3A%22%E6%B9%96%E5%8D%97%3A%E9%95%BF%E6%B2%99%22%2C%22ip%22%3A%22175.0.62.249%22%2C%22sex%22%3A%221%22%2C%22frontdomain%22%3A%22www.19lou.com%22%2C%22category%22%3A%22%E6%83%85%E6%84%9F%2C%E5%A9%9A%E5%BA%86%2C%E6%97%B6%E5%B0%9A%22%7D; Hm_lpvt_5185a335802fb72073721d2bb161cd94=1659615006; fr_adv_last=merry_thread_pc',
        'Host': 'www.19lou.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
    }
    # 发送请求 --> <Response [200]> 表示请求成功
    # requests模块里面get请求方法对于url地址发送请求, 并且携带上headers请求头伪装, 最后用response自定变量接受返回数据
    response = requests.get(url=url, headers=headers)
    # 2. 获取数据, 获取服务器返回响应数据 ----> 对应 开发者工具里面 response print(response.text)
    """
    3. 解析数据, 提取我们想要数据内容 基本信息
    bs4 lxml parsel.... 解析模块
    - 解析方法: 都要学习掌握, 没有最好的 ---> 只有最适合的
        re: 直接对于字符串数据进行提取
    
        css: 根据标签属性提取数据内容
        xpath: 根据标签节点提取数据内容
    今日选择css选择器:
        根据标签属性提取数据内容
    
    都需要进行类型转换: 转成可解析对象
        因为我们得到 response.text ---> 字符串数据类型
    
    pycharm翻译是需要安装插件 ---> 找落落老师去要
    
    css选择器解析方法教学, 在系统课程 2.5个小时
    
    """

尾语 💝

好了,我的这篇文章写到这里就结束啦!

有更多建议或问题可以评论区或私信我哦!一起加油努力叭(ง •_•)ง

喜欢就关注一下博主,或点赞收藏评论一下我的文章叭!!!

请添加图片描述