您现在的位置是：首页 > Python

当前栏目

python爬取新闻网站内容findall函数爬取

Python 文件网站

2023-04-18 14:43:58 时间

这个实验主要爬取新闻网站首页的新闻内容保存到本地，爬取内容有标题、时间、来源、评论数和正文。

工具：python 3.6 谷歌浏览器

爬取过程：

一、安装库：urllib、requests、BeautifulSoup

1、urllib库：Urllib是python内置的HTTP请求库。用这个库可以用python请求网页获取信息。

主要用到的函数：

data = urllib.request.urlopen(qurl).read()

#qurl为网页的网址，利用这个函数可以获取该网页的内容data

2、requests库：requests是python实现的简单易用的HTTP库，使用起来比urllib简洁很多。这个实验我两个库都用了，作用类似。

data = requests.get(url).text

3、BeautifulSoup库

当我们通过上面两个库获得了网页的数据的时候，我们需要从数据中提取我们想要的，这时BeautifulSoup就派上了用场。BeautifulSoup可以为我们解析文档，抓取我们想要的新闻标题、正文内容等。

4、re 库

正则表达式的库，正则表达式大家都明白的。

二、爬取新闻首页，得到所有要爬取新闻的链接

因为新闻首页首页只有新闻的标题，新闻的具体信息要点进标题链接进入另一个网页查看。所以我们首先要在新闻首页把所有要爬取新闻的链接保存到一个txt文件里。先上代码再解释。

def getQQurl(): #获取腾讯新闻首页的所有新闻链接

url = "http://news.qq.com/"

urldata = requests.get(url).text

soup = BeautifulSoup(urldata, 'lxml')

news_titles = soup.select("div.text > em.f14 > a.linkto")

fo = open("D:/news/QQ链接.txt", "w+") # 创建TXT文件保存首页所有链接

# 对返回的列表进行遍历写入文件

for n in news_titles:

title = n.get_text()

link = n.get("href")

fo.writelines(link + " ")

fo.close()

函数的前两行代码前面已经解释了，就解释一下三四行代码吧。

soup = BeautifulSoup(wbdata, ‘lxml’) #解析获取的文件，解析器为lxml

news_titles = soup.select(“div.text > em.f14 > a.linkto”)

分析新闻网页源代码的时候我们可以发现，首页新闻的链接大多数在图片中的地方

python爬取新闻网站内容findall函数爬取

由此我们可以利用soup.select()把所有标签div.text > em.f14 > a.linkto对应的数据挑选出来，因此是一个列表。再用get(“herf”)把链接挑选出来，写在TXT文件里面。

python爬取新闻网站内容findall函数爬取

一般新闻网站首页的新闻链接按板块不同在源代码中的标签也不同，挑选规则也不同。如果想挑选多个板块的新闻的话可以多写几种规则。

三、根据链接文件依次爬取每个链接对应的新闻数据

当把所有新闻的链接写在一个文件后，我们剩下要做的就是循环读取每个链接，利用第二步得到链接类似的办法得到新闻的相关数据。

分析新闻的网页源代码我们可以发现，标题都放在title标签下，而正文内容都在p标签下，由此我们可以用

content = soup.select(‘p’) # 选择正文内容

title = soup.select(‘title’) # 选择标题将它们挑选出来，时间和来源等信息可以用类似的方法挑选。

当这些信息被挑选出来后，它们都是以列表的形式，所以我们要将它们依次写入文件，整体代码如下。

python爬取新闻网站内容findall函数爬取

def getqqtext():

qqf = open("D:/news/QQ链接.txt", "r")

qqurl = qqf.readlines() # 读取文件，得到一个链接列表

i = 0

# 遍历列表，请求网页，筛选出正文信息

for qurl in qqurl:

try:

data = urllib.request.urlopen(qurl).read()

data2 = data.decode("gbk", "ignore")

soup = BeautifulSoup(data2, "html.parser") # 从解析文件中通过select选择器定位指定的元素，返回一个列表

content = soup.select('p') # 选择正文内容

title = soup.select('title') # 选择标题

time = soup.select('div.a_Info > span.a_time')

author = soup.select('div.a_Info > span.a_source')

# 将得到的网页正文写进本地文件

if (len(time) != 0):

fo = open("D:/news/新闻/腾讯" + str(i) + ".txt", "w+")

if (len(title) != 0):

fo.writelines(" " + title[0].get_text().strip() + " ")

fo.writelines("时间："+time[0].get_text().strip() + " ")

fo.writelines("评论数: 0" + " ")

if (len(author) != 0):

fo.writelines("来源："+author[0].get_text() + ' '+ " ")

# print(title[0].get_text())

# print(time[0].string)

# print(author[0].get_text()

for m in range(0, len(content)):

con = content[m].get_text().strip()

if (len(con) != 0):

fo.writelines(" " + con)

m += 1

fo.close()

except Exception as err:

print(err)

i += 1

四、其他网站特殊的情况

网易新闻有一个新闻排行榜，我直接爬了这个排行榜，里面按类别划分新闻，有跟帖排行，评论排行，分析网页的源代码很有意思，可以尝试把跟帖数和评论数爬下来。代码在后面。

新浪新闻的评论数是动态数据，分析网页源代码无法找到这个数据，所以我利用谷歌浏览器的开发者工具分析动态数据(具体方法可看网上教程)，得到了新浪存放评论数的网页，好像是用PHP写的用beautifulsup提取不出来，所以我用了re，提取里面的top_num(热点数)和链接。值得注意的是，这个网页的链接给得很奇葩，不是标准格式，类似http://ent.sina.com.cn/m/v….所以还是要转换一下，具体就不细讲了，可以看代码。

python爬取新闻网站内容findall函数爬取

五、总结

所以整个过程大概就三个步骤，其它几个网站也适用。重点是要去分析网页源代码，不同的网页不同数据在源代码的位置不同，根据不同的规则利用soup.select()就可以灵活操作。网上也有一些常用网站该怎么爬取的规则，可以参考一下。

六、完整代码

可运行，需要自己改一下路径，只有两个文件夹，D：/news D:/news/新闻

import json

import os

import requests

from bs4 import BeautifulSoup

import urllib.request

import re

import io

import sys

from urllib.parse import quote

import codecs

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

# 函数功能：得到网易新闻

def get163news():

url = "http://news.163.com/rank/" # 请求网易新闻的URL，获取其text文本

wbdata = requests.get(url).text # 对获取到的文本进行解析

soup = BeautifulSoup(wbdata, 'lxml') # 创建一个beautifulsoup对象

news_titles = soup.select("td a") # 从解析文件中通过select选择器定位指定的元素，返回一个列表

comment = soup.select("td.cBlue") #获取网页内容的步骤对应其它网页相同，不予赘述

# 循环链接列表将获取到的标题、时间、来源、评论、正文写进txt文件

start = 3

i = 30

n = 30

for strat in range(30,500):

for n in range(start, start + 29):

link = news_titles[n].get("href")

try:

neteasedata = urllib.request.urlopen(link).read()

neteasedata2 = neteasedata.decode("gbk", "ignore")

soup = BeautifulSoup(neteasedata2, "html.parser")

content = soup.select('p')

title = soup.select('title')

time = soup.select('div.post_time_source')

author = soup.select('div.post_time_source > a.ne_article_source')

if (len(time) != 0):

fo = open("D:/news/新闻/网易" + str(i) + ".txt", "w+")

if (len(title) != 0):

fo.writelines(" " + title[0].get_text().strip() + " ")

fo.writelines("时间：" + time[0].get_text().strip() + " ")

fo.writelines("评论数: " + comment[i].get_text() + " " )

if (len(author) != 0):

fo.writelines(author[0].get_text() + ' ')

# print(title[0].get_text())

# print(time[0].string)

# print(author[0].get_text()

for m in range(2, len(content)):

try:

con = content[m].get_text().strip()

if (len(con) != 0):

fo.writelines(" " + con)

except Exception as err:

print(err)

m += 1

fo.close()

except Exception as err:

print(err)

i += 1

n += 1

start += 60

n = start

i = start

if(start > 270):

break

# 函数功能：得到腾讯新闻首页所有新闻链接

def getQQurl():

url = "http://news.qq.com/"

wbdata = requests.get(url).text

soup = BeautifulSoup(wbdata, 'lxml')

news_titles = soup.select("div.text > em.f14 > a.linkto")

fo = open("D:/news/QQ链接.txt", "w+") # 创建TXT文件保存首页所有链接

# 对返回的列表进行遍历

for n in news_titles:

title = n.get_text()

link = n.get("href")

fo.writelines(link + " ")

fo.close()

# 函数功能：根据获取的链接依次爬取新闻正文并保存到本地

def getqqtext():

qqf = open("D:/news/QQ链接.txt", "r")

qqurl = qqf.readlines() # 读取文件，得到一个链接列表

i = 0

# 遍历列表，请求网页，筛选出正文信息

for qurl in qqurl:

try:

data = urllib.request.urlopen(qurl).read()

data2 = data.decode("gbk", "ignore")

soup = BeautifulSoup(data2, "html.parser") # 从解析文件中通过select选择器定位指定的元素，返回一个列表

content = soup.select('p') # 选择正文内容

title = soup.select('title') # 选择标题

time = soup.select('div.a_Info > span.a_time')

author = soup.select('div.a_Info > span.a_source')

# 将得到的网页正文写进本地文件

fo = open("D:/news/新闻/腾讯" + str(i) + ".txt", "w+")

if (len(title) != 0):

fo.writelines(" " + title[0].get_text().strip() + " ")

if(len(time)!=0):

fo.writelines("时间："+time[0].get_text().strip() + " ")

if (len(author) != 0):

fo.writelines("来源："+author[0].get_text() + ' '+ " ")

# print(title[0].get_text())

# print(time[0].string)

# print(author[0].get_text()

for m in range(0, len(content)):

con = content[m].get_text().strip()

if (len(con) != 0):

fo.writelines(" " + con)

m += 1

fo.close()

except Exception as err:

print(err)

i += 1

#函数功能：得到搜狐新闻首页所有新闻链接

def getsohuurl():

url = "http://news.sohu.com/"

wbdata = requests.get(url).text

soup = BeautifulSoup(wbdata, 'lxml')

news_titles = soup.select("div.list16 > ul > li > a")

fo = open("D:/news/sohu链接.txt", "w+")

for n in news_titles:

title = n.get_text()

link = n.get("href")

fo.writelines(link + " ")

fo.close()

# 函数功能：根据获取的搜狐新闻链接依次爬取新闻正文并保存到本地

def getsohutext():

sohuf = open("D:/news/sohu链接.txt", "r")

sohuurl = sohuf.readlines()

i = 0

for sohuu in sohuurl:

try:

sohudata = urllib.request.urlopen(sohuu).read()

sohudata2 = sohudata.decode("utf-8", "ignore")

soup = BeautifulSoup(sohudata2, "html.parser")

content = soup.select('p')

title = soup.select('title')

time = soup.select('div.article-info > span.time')

author = soup.select('div.date-source > span.original-link')

if (len(time) != 0):

fo = open("D:/news/新闻/搜狐" + str(i) + ".txt", "w+")

if (len(title) != 0):

fo.writelines( " " + title[0].get_text().strip() + " ")

fo.writelines("时间：" + time[0].get_text().strip() + " ")

fo.writelines("评论数: 0" + " " + " ")

if (len(author) != 0):

fo.writelines(author[0].get_text() + ' ')

# print(title[0].get_text())

# print(time[0].string)

# print(author[0].get_text()

for m in range(0, len(content)):

con = content[m].get_text().strip()

if (len(con) != 0):

fo.writelines(" " + con)

m += 1

fo.close()

except Exception as err:

print(err)

i += 1

#函数功能：得到新浪新闻首页所有新闻链接

def getsinaurl():

url = ['http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=qbpdpl&top_time=20180715&top_show_num=100&top_order=DESC&js_var=comment_all_data',

'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=www_www_all_suda_suda & top_time=20180715&top_show_num=100&top_order=DESC&js_var=all_1_data01',

'http://top.collection.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=wbrmzf_qz&top_time=20180715&top_show_num=10&top_order=DESC&js_var=wbrmzf_qz_1_data&call_back=showContent',

'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=total_slide_suda&top_time=20180715&top_show_num=100&top_order=DESC&js_var=slide_image_1_data',

'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=wbrmzfgwxw&top_time=20180715&top_show_num=10&top_order=DESC&js_var=wbrmzfgwxw_1_data&call_back=showContent',

'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=news_china_suda&top_time=20180715&top_show_num=20&top_order=DESC&js_var=news_',

'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=gnxwpl&top_time=20180715&top_show_num=20&top_order=DESC&js_var=news_']

furl = open("D:/news/sina链接1.txt", "w+")

fcom = open("D:/news/sinacom.txt", "w+")

for u in url:

try:

wbdata = requests.get(u).text

fo = open("D:/news/sinau.txt", "w+")

fo.write(wbdata)

fo.close()

text = open("D:/news/sinau.txt", "r").read()

allurl = re.findall('"url":"(.+?)",', text)

topnum = re.findall('"top_num":"(.+?)",', text)

print(len(allurl))

print(len(topnum))

for n in allurl:

# s=n.encode ("utf-8")

# print(s)

furl.writelines(n + " ")

for n in topnum:

fcom.writelines(n + " ")

except Exception as err:

print(err)

furl.close()

fcom.close()

# sinaf = codecs.open("D:/news/sina链接1.txt", 'r', 'utf-8')

# 函数功能：根据获取的新浪新闻链接依次爬取新闻正文并保存到本地

def getsinanews():

sinaf1 = open("D:/news/sina链接1.txt", "r")

sinaf2 = open("D:/news/sinacom.txt", "r")

sinaurl = sinaf1.readlines()

sinacom = sinaf2.readlines()

i = 0

for surl in sinaurl:

try:

realurl = surl.replace('/', '/')

sinadata = urllib.request.urlopen(realurl).read()

sinadata2 = sinadata.decode("utf-8", "ignore")

soup = BeautifulSoup(sinadata2, "html.parser")

content = soup.select('p')

title = soup.select('title')

time = soup.select('div.date-source > span.date')

author = soup.select('div.date-source > a.source')

# comments = soup.select('div.hd clearfix > span.count > em > a.comment_participatesum_p')

# print(len(comments))

if (len(time) != 0):

fo = open("D:/news/新闻/新浪" + str(i) + ".txt", "w+")

if (len(title) != 0):

fo.writelines(" " + title[0].get_text().strip() + " ")

fo.writelines("时间：" + time[0].get_text().strip() + " ")

fo.writelines("评论数: " + sinacom[i] )

if (len(author) != 0):

fo.writelines(author[0].get_text() + ' ')

for m in range(0, len(content)):

con = content[m].get_text().strip()

if (len(con) != 0):

fo.writelines(" " + con)

m += 1

fo.close()

except Exception as err:

print(err)

i += 1

def main():

get163news()

getQQurl()

getqqtext()

getsinaurl()

getsinanews()

getsohuurl()

getsohutext()

main()

ps:编程小白，刚刚上路，请多关照。欢迎关注我的微博：努力学习的小谯同学

猜你喜欢

Jease 2.6发布 Java开源内容框架
EasyCVR对接华为iVS订阅摄像机和用户变更请求接口介绍
JVM调优总结：反思
【技术种草】cdn+轻量服务器+hugo=让博客“云原生”一下
JVM调优总结：调优方法
前端面试【JavaScript】— typeof 是否能正确判断类型？
JVM调优总结：新一代的垃圾回收算法
前端面试【JavaScript】— instanceof 能否判断基本数据类型？
JVM调优总结：典型配置举例
前端面试【JavaScript】— 能不能手动实现一下 instanceof 的功能？
前端面试【JavaScript】— Object.is和=== 有什么区别？
JVM调优总结：分代垃圾回收详述
前端面试【JavaScript】— JS中类型转换有哪几种？
WPF开发入门尝试
前端面试【JavaScript】— == 和 ===有什么区别？
一个Java程序员对2011年的回顾
前端面试【JavaScript】— 对象转原始类型是根据什么流程运行的？
JVM调优总结：垃圾回收面临的问题
直接在代码里面对list集合进行分页
JVM调优总结：基本垃圾回收算法

zl程序教程

当前栏目

python爬取新闻网站内容findall函数爬取

相关文章