您现在的位置是：首页 > 后端

当前栏目

nodejs中使用cheerio爬取并解析html网页

Nodejs 网页 HTML 解析爬取使用

2023-09-27 14:19:37 时间

nodejs中使用cheerio爬取并解析html网页

转 https://www.jianshu.com/p/8e4a83e7c376

cheerio用于node环境，用法与语法都类似于jquery。jquery本身也可以用于node，在借助于第三方库jsdom的情况下，详见：https://www.npmjs.com/package/jquery

安装

npm install cheerio

使用

const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')
 
$('h2.title').text('Hello there!')
$('h2').addClass('welcome')
 
$.html()
//=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

运用

场景

取出这个网址中的文章列表：https://support.fcoin.com/hc/zh-cn/sections/360000782633-最新公告

分析html源码

重点关注class="article-list-link"的<a>，提取它里attr属性与text内容，就算完成目标了。

<ul class="article-list">
          
            <li class="article-list-item ">
              
              <a href="/hc/zh-cn/articles/360006803454-FT%E9%A2%84%E5%85%88%E5%8F%91%E8%A1%8C%E9%83%A8%E5%88%86%E5%AE%9E%E8%A1%8C-%E8%A7%A3%E5%86%BB%E5%8D%B3%E9%94%81%E4%BB%93-%E5%8E%9F%E5%88%99%E7%9A%84%E5%85%AC%E5%91%8A" class="article-list-link">FT预先发行部分实行“解冻即锁仓”原则的公告</a>
            </li>
          
            <li class="article-list-item ">
              
              <a href="/hc/zh-cn/articles/360006823933-%E5%85%B3%E4%BA%8EFInsur%E8%BF%90%E4%BD%9C%E6%9C%BA%E5%88%B6%E7%9A%84%E4%B8%80%E7%B3%BB%E5%88%97%E8%AF%B4%E6%98%8E" class="article-list-link">关于FInsur运作机制的一系列说明</a>
            </li>
...

代码与注释说明

var request = require('request')
const cheerio = require('cheerio')

var http = (uri) => {
  return new Promise((resolve, reject) => {
    request({
      uri: uri,
      method: 'GET'
    }, (err, response, body) => {
      if (err) {
        console.log(err)
      }
      resolve(body)
    })
  })
}

(function () {
  // 定义目标网址
  var target = 'https://support.fcoin.com/hc/zh-cn/sections/360000782633-%E6%9C%80%E6%96%B0%E5%85%AC%E5%91%8A'
  // 使用request.js库发送get请求
  http(target).then(html => {
    // 载入并初始化cheerio
    const $ = cheerio.load(html)
    // 取出目标节点，即带article-list-link css类的<a>
    var linksDom = $('a.article-list-link')
    // 遍历dom集数组
    linksDom.each((index, item) => {
      // 取出title，注意这里使用了$(item)，而不是item本身
      var title = $(item).text()
      // 类似地，取出链接地址
      var url = $(item).attr('href')
      // 解码可选，为了让结果显示中文汉字更直观
      url = decodeURIComponent(url)
      // 由于href使用的是相对于根目标的路径，因而从目标网址中提取域名前缀拼接上
      url = target.match(/(\w+:\/\/[^/:]+)([^# ]*)/)[1] + url
      // 输出到控制台预览结果
      console.log(title)
      console.log(url)
    })
  })
})()

预览结果

result.png

对比说明

相比纯正则表达式解析，使用cheerio轻松，语义也清晰，特别适合html文本这种特定环境下使用。

猜你喜欢

Docker容器（六）——创建docker私有化仓库
桌面PC/服务器 ubuntu18.04 Linux内核编译升级与机制分析
kindle】扫描版PDF完美切割六寸
文件操作与文件流
linux所有命令都无法使用,rm -f * 删除根目录
eclipse调试总结(转)
鸿蒙OS还有机会吗？
常用云盘总结
基于C++实现（控制台）简单计算器【100010735】
【SpringBoot源码分析】Bean的加载过程
_.memoize(func, [resolver])
网络通信技术中的中继器repeater
Nginx: UDP

相关主题

Ubuntu安装nodeJS
[Linux]安装nodejs
Nodejs子进程
06-Nodejs介绍
nodejs express
2017 nodeJS
nodejs调试
安装NodeJs
nodeJS基础2
Centos Nodejs
Nodejs Stream(流)
nodejs之路由

zl程序教程

当前栏目

nodejs中使用cheerio爬取并解析html网页

nodejs中使用cheerio爬取并解析html网页

安装

使用

运用

场景

分析html源码

代码与注释说明

预览结果

对比说明

相关文章