php抓取网页body内容,并过滤网页标签
2023-09-14 08:57:34 时间
php只抓取网页文字内容,并过滤其标签,说干就干,开始!
<?php function curl_request ( $url , $post = '' , $cookie = '' , $returnCookie = 0 ) { $ua = $ua==''?$_SERVER ['HTTP_USER_AGENT']:'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)' ; $curl = curl_init ( ) ; curl_setopt ( $curl , CURLOPT_URL , $url ) ; curl_setopt ( $curl , CURLOPT_USERAGENT , $ua ) ; curl_setopt ( $curl , CURLOPT_FOLLOWLOCATION , 1 ) ; curl_setopt ( $curl , CURLOPT_AUTOREFERER , 1 ) ; curl_setopt ( $curl , CURLOPT_REFERER , "https://www.baidu.com" ) ; if ( $post ) { curl_setopt ( $curl , CURLOPT_POST , 1 ) ; curl_setopt ( $curl , CURLOPT_POSTFIELDS , http_build_query ( $post ) ) ; } if ( $cookie ) { curl_setopt ( $curl , CURLOPT_COOKIE , $cookie ) ; } curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false); curl_setopt ( $curl , CURLOPT_HEADER , $returnCookie ) ; curl_setopt ( $curl , CURLOPT_TIMEOUT , 10 ) ; curl_setopt ( $curl , CURLOPT_RETURNTRANSFER , 1 ) ; $data = curl_exec ( $curl ) ; if ( curl_errno ( $curl ) ) { return curl_error ( $curl ) ; } curl_close ( $curl ) ; if ( $returnCookie ) { list ( $header , $body ) = explode ( "\r\n\r\n" , $data , 2 ) ; preg_match_all ( "/Set\-Cookie:([^;]*);/" , $header , $matches ) ; $info [ 'cookie' ] = substr ( $matches [ 1 ] [ 0 ] , 1 ) ; $info [ 'content' ] = $body ; return $info ; } else { //return $data ; $data=mb_convert_encoding($data, 'UTF-8', 'UTF-8,GBK,GB2312,BIG5'); preg_match("/<body.*?>(.*?)<\/body>/is",$data,$match); $str= trim($match[1]); $html = strip_tags($str); $html_len = mb_strlen($html,'UTF-8'); $html = mb_substr($html, 0, strlen($html), 'UTF-8'); $search = array(" "," ","\n","\r","\t"); $replace = array("","","","",""); echo str_replace($search, $replace, $html); } } curl_request ( $url, $post = '' , $cookie = '' , $returnCookie = 0 ); ?>
相关文章
- PHP IE9 AJAX success 返回 undefined 问题解决
- php 一个文件搞定支付宝支付,微信支付
- 使用XHProf分析PHP性能瓶颈(一)
- PHP判断远程文件是否存在
- PHP计算两个时间段是否有交集(边界重叠不算)
- PHP Socket编程 之 php中连接tcp服务的三种方式
- Linux 上安装 PHP 扩展
- php:如何使用PHP排序, key为字母+数字的数组(多维数组)
- 解决PHP中file_get_contents抓取网页中文乱码问题
- php分享十八:网页抓取
- PHP中获取当前页面的完整URL & php $_SERVER中的SERVER_NAME 和HTTP_HOST的区别
- PHP中MySQL、MySQLi和PDO的用法和区别
- PHP读取配置文件类(php,ini,yaml,xml)
- PHP 正则匹配中文
- [php-src] Php扩展的多文件编译
- Atitit 前端测试最简化内嵌web服务器 php 与node.js 目录 1.1. php内置Web Server1 1.2. Node的2 Node的比较麻烦些。。Php更加简单
- atitit.软件开发GUI 布局管理优缺点总结java swing wpf web html c++ qt php asp.net winform
- PHP 5 SimpleXML 函数
- PHP面试题:你所知道的php数组相关的函数?
- PHP+FastCGI+Nginx动态请求处理配置
- CentOS下yum安装PHP,配置php-fpm服务
- php使用solr全文搜索引擎
- PHP PSR-2 代码风格规范 (中文版)