如果您只想(我引用)“提取标签”,我将其解释为打开节点,在您的 html 文本的正文语句中,您可以使用下面的解决方案。
请注意,这是野蛮的。您不应该使用正则表达式“解析”html(我知道您知道,但其他读者可能不知道)。
// simple html file, has head/body and line breaks
String html = "<html>\r\n<head>\r\n<title>Foo</title>\r\n</head>\r\n" +
"<body>\r\n<h1>Blah</h1>\r\n<h3>Meh</h3>\r\n</body>\r\n</html>";
// the pattern only checks for opening nodes
Pattern tagsWithinBody = Pattern.compile("<\\p{Alnum}+>");
// matcher is applied to whatever text is in between the "<body>" open and close nodes
Matcher matcher = tagsWithinBody.matcher(html.substring(html.indexOf("<body>") + 1, html.indexOf("</body>")));
// iterates over matcher as long as it finds text
while (matcher.find()) {
System.out.println(matcher.group());
}
输出:
<h1>
<h3>