您现在的位置是：首页 > 前端

当前栏目

jsoup HTML parser hello world examples--转

HTML -- Hello World parser Jsoup Examples

2023-09-11 14:21:40 时间

原文地址：http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

Jsoup, a HTML parser, its “jquery-like” and “regex” selector syntax is very easy to use and flexible enough to get whatever you want. Below are three examples to show you how to use Jsoup to get links, images, page title and “div” element content from a HTML page.

Download jsoup
The jsoup is available in Maven central repository. For non-Maven user, just download it from jsoup website.

pom.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

1. Grabs All Hyperlinks

This example shows you how to use jsoup to get page’s title and grabs all links from “google.com”.

HTMLParserExample1.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HTMLParserExample1 {

    public static void main(String[] args) {

        Document doc;
        try {

            // need http protocol
            doc = Jsoup.connect("http://google.com").get();

            // get page title
            String title = doc.title();
            System.out.println("title : " + title);

            // get all links
            Elements links = doc.select("a[href]");
            for (Element link : links) {

                // get the value from href attribute
                System.out.println("\nlink : " + link.attr("href"));
                System.out.println("text : " + link.text());

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

Output

title : Google

link : http://www.google.com.my/imghp?hl=en&tab=wi
text : Images

link : http://maps.google.com.my/maps?hl=en&tab=wl
text : Maps

//omitted for readability

Note
It’s recommended to specify a “userAgent” in Jsoup, to avoid HTTP 403 error messages.

Document doc = Jsoup.connect("http://anyurl.com")
.userAgent("Mozilla")
.get();

2. Grabs All Images

The second example shows you how to use the Jsoup regex selector to grab all image files (png, jpg, gif) from “yahoo.com”.

HTMLParserExample2.java

package com.mkyong;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HTMLParserExample2 {

    public static void main(String[] args) {

        Document doc;
        try {

            //get all images
            doc = Jsoup.connect("http://yahoo.com").get();
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
            for (Element image : images) {

                System.out.println("\nsrc : " + image.attr("src"));
                System.out.println("height : " + image.attr("height"));
                System.out.println("width : " + image.attr("width"));
                System.out.println("alt : " + image.attr("alt"));

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

Output

src : http://l.yimg.com/a/i/mntl/ww/events/p.gif
height : 50
width : 202
alt : Yahoo!

src : http://l.yimg.com/a/i/ww/met/intl_flag_icons/20111011/my_flag.gif
height : 
width : 
alt :

//omitted for readability

3. Get Meta elements

The last example simulates an offline HTML page and use jsoup to parse the content. It grabs the “meta” keyword and description, and also the div element with the id of “color”.

HTMLParserExample3.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HTMLParserExample3 {

    public static void main(String[] args) {

        StringBuffer html = new StringBuffer();

        html.append("<!DOCTYPE html>");
        html.append("<html lang=\"en\">");
        html.append("<head>");
        html.append("<meta charset=\"UTF-8\" />");
        html.append("<title>Hollywood Life</title>");
        html.append("<meta name=\"description\" content=\"The latest entertainment news\" />");
        html.append("<meta name=\"keywords\" content=\"hollywood gossip, hollywood news\" />");
        html.append("</head>");
        html.append("<body>");
        html.append("<div id='color'>This is red</div> />");
        html.append("</body>");
        html.append("</html>");

        Document doc = Jsoup.parse(html.toString());

        //get meta description content
        String description = doc.select("meta[name=description]").get(0).attr("content");
        System.out.println("Meta description : " + description);

        //get meta keyword content
        String keywords = doc.select("meta[name=keywords]").first().attr("content");
        System.out.println("Meta keyword : " + keywords);

        String color1 = doc.getElementById("color").text();
        String color2 = doc.select("div#color").get(0).text();

        System.out.println(color1);
        System.out.println(color2);

    }

}

Output

Meta description : The latest entertainment news
Meta keyword : hollywood gossip, hollywood news
This is red
This is red

4. Grabs Form Inputs

This code snippets shows you how to use Jsoup to grab HTML form inputs (name and value). For detail usage, please refer to this automate login a website with Java.

public void getFormParams(String html){

	Document doc = Jsoup.parse(html);

	//HTML form id
	Element loginform = doc.getElementById("your_form_id");

	Elements inputElements = loginform.getElementsByTag("input");

	List<String> paramList = new ArrayList<String>();
	for (Element inputElement : inputElements) {
		String key = inputElement.attr("name");
		String value = inputElement.attr("value");
	}

}

5. Get Fav Icon

This code shows you how to use Jsoup to page’s favourite icon.

jSoupExample.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class jSoupExample {

    public static void main(String[] args) {

	StringBuffer html = new StringBuffer();

	html.append("<html lang=\"en\">");
	html.append("<head>");
	html.append("<link rel=\"icon\" href=\"http://example.com/image.ico\" />");
	//html.append("<meta content=\"/images/google_favicon_128.png\" itemprop=\"image\">");
	html.append("</head>");
	html.append("<body>");
	html.append("something");
	html.append("</body>");
	html.append("</html>");

	Document doc = Jsoup.parse(html.toString());

	String fav = "";

	Element element = doc.head().select("link[href~=.*\\.(ico|png)]").first();
	if(element==null){

		element = doc.head().select("meta[itemprop=image]").first();
		if(element!=null){
			fav = element.attr("content");
		}
	}else{
		fav = element.attr("href");
	}
	System.out.println(fav);
  }

}

Output

http://example.com/image.ico

猜你喜欢

centos7下快速安装mysql
JavaScript进阶系列04,函数参数个数不确定情况下的解决方案
Amoeba for MySQL 非常好用的mysql集群软件
Win7 64位安装E10后打不开的解决方案 -摘自网络
Big Data Analytics for Security（Big Data Analytics for Security Intelligence）
二、【手机摄影】手机专业拍照模式介绍
Linux下汇编语言学习笔记70 ---
win10 安装sourcetree跳过注册
TypeScript “==“ 和 “===“区别
美国物联网平台Evrythng获得2480万美元B轮融资
Nginx监控
关于springmvc 文件下载，nginx 转发去掉了content-length ,浏览器前端下载没有进度，无法显示文件总大小的问题。
JDK14性能管理工具:jstat使用介绍
2009年4月8日博客改进公告！

相关主题

前端(一)-Html
html语法一
HTML初学
HTML 链接
HTML 编辑器
HTML-爱心
HTML简单使用
HTML转义字符
HTML跳转
html中的列表
HTML中的表单
html 标题
HTML(二)选择器
Html 符号
Html 基础三
Html 基础二
html转pdf
HTML标签（四）
HTML---HTML简介

zl程序教程

当前栏目

jsoup HTML parser hello world examples--转

1. Grabs All Hyperlinks

2. Grabs All Images

3. Get Meta elements

4. Grabs Form Inputs

5. Get Fav Icon

相关文章