您现在的位置是：首页 > 后端

当前栏目

第14.9节 Python中使用urllib.request+BeautifulSoup获取url访问的基本信息

Python 获取信息访问基本 url request urllib

2023-09-27 14:26:58 时间

利用urllib.request读取url文档的内容并使用BeautifulSoup解析后，可以通过一些基本的BeautifulSoup对象输出html文档的基本信息。以博文《第14.6节使用Python urllib.request模拟浏览器访问网页的实现代码》访问为例，读取和解析代码如下：

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>> def getURLinf(url): 
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
    req = urllib.request.Request(url=url,headers=header)
    resp = urllib.request.urlopen(req,timeout=5)
    html = resp.read().decode()
  
    soup = BeautifulSoup(html,'lxml')
    return (soup,req,resp) 
>>>  soup,req ,resp  = getURLinf(r'https://blog.csdn.net/LaoYuanPython/article/details/100629947')

可获取的基本信息包括：
1、文档标题

>>> soup.title
<title>第14.6节 使用Python urllib.request模拟浏览器访问网页的实现代码 - 老猿Python - CSDN博客</title>

2、文档是否为xml文档

>>> soup.is_xml
False

3、文档的url地址

>>> req.full_url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.geturl()
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>>

4、文档所在的主机

>>> req.host
'blog.csdn.net'

5、请求头的信息

>>> req.header_items()
[('Host', 'blog.csdn.net'), ('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')]
>>>

6、响应状态码

>>> resp.getcode()
200
>>>

7、响应http报文头信息

>>> resp.headers.items()
[('Date', 'Sun, 08 Sep 2019 15:07:12 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Set-Cookie', 'acw_tc=2760828215679552322374611eb7315abdcfe4ee6f7af5d157db5621c4267d;path=/;HttpOnly;Max-Age=2678401'), ('Server', 'openresty'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'uuid_tt_dd=10_19729129290-1567955232238-614052; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Set-Cookie', 'dc_session_id=10_1567955232238.557324; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Vary', 'Accept-Encoding'), ('Strict-Transport-Security', 'max-age=86400')]
>>>

本节介绍了使用urllib.request读取url文档的内容并使用BeautifulSoup解析后可以很方便的获取的一些url访问的基本信息，通过这些信息可以对本次访问提供一些概要的信息。

老猿Python，跟老猿学Python!
博客地址：https://blog.csdn.net/LaoYuanPython
老猿Python博客文章目录：https://blog.csdn.net/LaoYuanPython/article/details/98245036
请大家多多支持，点赞、评论和加关注！谢谢！

猜你喜欢

轻松Hold住的Pytest，单元测试框架
CentOS下使用crontab命令来定时执行任务
Python学习笔记之在Python中实现单例模式
(原)从mp4,flv文件中解析出h264和aac,送解码器解码失败
纪中2019暑假培训（7.8）
中学
联想Z510升级BCM94352HMB刷网卡白名单曲折经历
HBase replication
java web 开发快速宝典 ------电子书
MySQL 部署分布式架构 MyCAT (二)
Error: [ng:areq]
Docker（4）- Docker 命令大全
Systemd 入门教程：实战篇
如何在Ubuntu环境下搭建邮件服务器（一）
【Scala】Scala-循环与遍历
精心收集的 95 个超实用的 JavaScript 代码片段（ ES6+ 编写）
CentOS7 Docker pull修改镜像源

相关主题

python贪吃蛇
Python request
python OpenCV使用
Python使用Redis
Python Json使用
python app
python列表1

zl程序教程

当前栏目

第14.9节 Python中使用urllib.request+BeautifulSoup获取url访问的基本信息

相关文章