您现在的位置是：首页 > 后端

当前栏目

Python抓取小说

Python 抓取小说

2023-09-11 14:21:01 时间

Python抓取小说

前言

这个脚本命令MAC在抓取小说写，使用Python它有几个码。

代码

# coding=utf-8

import re
import urllib2
import chardet
import sys
from bs4 import BeautifulSoup
import codecs

class Spider():

    def __init__(self):
        self.aTag=re.compile("<a href=\"(http://www.44pq.com/read/[0-9]+?_[0-9]+?.html)\"[^>]*?>(.+?)</a>")
        self.contentTag=re.compile("<div class=\"readerContent\" id=\"content\">(.+?)</div>",re.I|re.S)

    def getHtml(self, url):
        headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
        req=urllib2.Request(url,headers=headers)
        response = urllib2.urlopen(req)
        
        html = response.read()
        return html
        #soup=BeautifulSoup(html.decode("GB18030","ignore"))
        #return soup.findAll("a")
        #return soup.prettify()
        #typeEncode = sys.getfilesystemencoding()
        #infoencode = chardet.detect(html).get('encoding','utf-8')
        #return html.decode('GB18030','ignore').encode("utf-8")
        return html.decode('GB18030','ignore').encode(sys.getfilesystemencoding())
    
    def Run(self):
        bookurl="http://www.44pq.com/read/13567.html"
        bookname="地球上唯一的魔法师"
        text=[]
        matchs=self.aTag.finditer(self.getHtml(bookurl))
        alist=list(matchs)
        total = len(alist)
        print "total {0}".format(total)
        i=0
        for m in alist:
            i+=1
            text.append(m.group(2).decode("gb18030"))
            text.append(self.getContent(m.group(1)))
	    self.writeFile(bookname,"\n\n".join(text))
	    del text[:]
            print "{0}/{1}".format(i,total)
        self.writeFile(bookname,"\n\n".join(text))
        print "done!"

    def writeFile(self,filename,text):
        f=open(filename+".txt","a")
        f.write(text)
        f.close()


    def getContent(self,url):
        c=self.getHtml(url)
        
        c=self.contentTag.search(c).group(1)
        c=re.sub("<[^>]+?>","",c)
        c=c.replace("nbsp;","").replace("&","")
        return c.decode("gb18030")


if __name__ == '__main__':
    reload(sys)
    sys.setdefaultencoding('utf-8')
    spider = Spider()
    spider.Run()

声明一下，实在搞不定CSDN编辑器的格式问题了，上述代码中：

self.writeFile(bookname,"\n\n".join(text))
del text[:]

这两行是在for循环里的，而不应该是与keywordfor对齐的。

上面不必要的import能够删掉。以小说《地球上唯一的魔法师》为例。aTag是匹配小说文件夹全部章节的正則表達式，contentTag是匹配小说正文的正則表達式。

须要声明一点，此代码每抓取一章。就写入文件一次。以防内存占用过大。

self.writeFile(bookname,"\n\n".join(text))
del text[:]

假设须要，也能够抓取N章写入文件一次，仅仅需增加一个简单的逻辑推断就OK了。占用多少内存和写多少次文件，每一个人有自己不同的衡量标准。

猜你喜欢

WebRTC中音频能量计算
RS485总线防雷保护方案
Python学习笔记使用matplotlib创建Gif动图
在OpenCV里用line画直线
[局限]脑子的局限，架构图，视图的一致性
ML之catboost：基于人类性别相关属性的数据集利用catboost模型实现二分类预测(男女性别预测，全流程案例，包括代码实现)
近5年常考Java面试题及答案整理（一）
Rockchip BT.656 TX 和 BT.1120 TX 开发指南
AngularJS 利用指令集成ZTree
一些用到过的正则表达式
mapreduce中counter的使用
2019汇智动力学院课程、服务体系震撼升级
由于找不到vcruntime140.dll,无法继续执行代码
js 原型与原型链
23种类设计模式--2原型模式
SAP云平台与企业数字型转型
ESLint 配置

相关主题

python 数据类型
Python 特殊方法
Python类成员
python-进程池

zl程序教程

当前栏目

Python抓取小说

Python抓取小说

前言

代码

相关文章