zl程序教程

您现在的位置是:首页 >  后端

当前栏目

干货分享丨Python从入门到编写POC之爬虫专题

Python爬虫入门 分享 编写 干货 专题 POC
2023-09-27 14:21:06 时间

Python从入门到编写POC系列文章是i春秋论坛作家「Exp1ore」表哥原创的一套完整教程,想系统学习Python技能的小伙伴,不要错过哦!

干货分享丨Python从入门到编写POC之爬虫专题

 

Python从入门到编写POC之爬虫专题

说到爬虫,用Python写的貌似是很多的。

举个例子,re模块,BeautifulSoup模块,pyspider模块,pyquery等,当然还要用到requests模块,urllib模块,urllib2模块,还有一个四叶草公司开发的hackhttp等等。

PS:BeautifulSoup模块和requests模块,Pyspider都要安装,因为是第三方库。

BeautifulSoup模块

<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

用BeautifulSoup创建一个对象

>>> from bs4 import BeautifulSoup>>> html = """... <html>... <head>... <title>The Dormouse's story</title>... </head>... <body>... <p class="title"><b>The Dormouse's story</b></p>...... <p class="story">Once upon a time there were three little sisters; and their names were... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;... and they lived at the bottom of a well.</p>... <p class="story">...</p>... </body>... </html>... """>>>>>> soup = BeautifulSoup(html)C:\Python27\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "html.parser")  markup_type=markup_type))

浏览结构化数据的方法

>>> soup.title<title>The Dormouse's story</title>>>> soup.title.nameu'title'>>> soup.p<p class="title"><b>The Dormouse's story</b></p>>>> soup.p['class'][u'title']>>> soup.head<head>\n<title>The Dormouse's story</title>\n</head>>>> soup.p.attrs{u'class': [u'title']}

如果是爬虫,比如说要爬所有的链接,分析html代码得到,都是在<a>标签那。所以用个循环,就可以完美的解决了。

>>> for link in soup.find_all('a'):...     print(link.get('href'))...[url]http://example.com/elsie[/url][url]http://example.com/lacie[/url][url]http://example.com/tillie[/url]

那如果我要爬去所有的文字信息呢?

就要用到下面的命令了:

>>> print soup.get_text()The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....

接下来,咱们写一个简单的爬虫,调用站长帮手,写一个查询子域名的工具。

首先,咱们抓包分析一下,这里用到的是Burp

POST /subdomain/ HTTP/1.1Host: i.links.cnContent-Length: 34Cache-Control: max-age=0Origin: [url]http://i.links.cn[/url]Upgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36Content-Type: application/x-www-form-urlencodedAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: [url]http://i.links.cn/subdomain/[/url]Accept-Language: zh-CN,zh;q=0.8Cookie: ASPSESSIONIDCCRSRCQS=NFFNBODCNABACIGOEODDFKLG; __guid=12224748.1912086146849820700.1503481265395.9385; UM_distinctid=15e0e7780082dd-0f197d4291ddaa-5d4e211f-1fa400-15e0e7780091e6; linkhelper=sameipb3=1&sameipb4=1&sameipb2=1; serverurl=; ASPSESSIONIDQARRSARR=DNCFMEADGBBFOICPGKMFCNPK; safedog-flow-item=; monitor_count=2; umid=umid=f449b116e07d1d4f3d2dc5352b7fede9&querytime=2017%2D8%2D24+14%3A09%3A09; CNZZDATA30012337=cnzz_eid%3D226371595-1503478989-%26ntime%3D1503554751Connection: closedomain=ichunqiu.com&b2=1&b3=1&b4=1

可以知道他是一个post包,然后提交的post数据是

domain=ichunqiu.com&b2=1&b3=1&b4=1

所以用requests模块:

#coding = utf-8import requestsurl = 'http://i.links.cn/subdomain/'payload = 'domain=ichunqiu.com&b2=1&b3=1&b4=1'r = requests.post(url=url,data=payload)print r.content

结果报了一个错

Traceback (most recent call last):  File "demo.py", line 8, in <module>    print r.textUnicodeEncodeError: 'gbk' codec can't encode character u'\\xcf' in position 386: illegal multibyte sequence

所以咱们要改一下编码:

import requestsurl = 'http://i.links.cn/subdomain/'payload = ("domain=ichunqiu.com&b2=1&b3=1&b4=1")r = requests.post(url=url,params=payload)con = r.text.encode('ISO-8859-1')

之后就打印出来了,然后就上re或者beautifulsoup了。

这里用re,简单明了。查看源码,得到在以下代码之间:

value="http://ichunqiu.com"/><input
import rea = re.compile('value="(.+?)"><input')result = a.findall(con)
干货分享丨Python从入门到编写POC之爬虫专题

 

然后转成列表

list = '\n'.join(result)print list
干货分享丨Python从入门到编写POC之爬虫专题

 

咱们继续完善这个代码,改源码查询是不是有点麻烦?

这里,咱们用sys库,然后就用那个命令函数,修改一下代码,再格式化一下,这里用到了format函数。

payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))

然后在定义一个get函数来获取domain这个变量。

#coding = utf-8 import requestsimport reimport sys def get(domain):        url = 'http://i.links.cn/subdomain/'        payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))        r = requests.post(url=url,params=payload)        con = r.text.encode('ISO-8859-1')        a = re.compile('value="(.+?)"><input')        result = a.findall(con)        list = '\n'.join(result)        print listif __name__ == '__main__':        command= sys.argv[1:]        f = "".join(command)        get(f)

这样子就好了,咱们实验一下。

干货分享丨Python从入门到编写POC之爬虫专题

 

以上是今天要分享的内容,大家看懂了吗?喜欢本文的小伙伴,记得文末点个赞哦~