java - 使用 Jsoup 存在 HTML 标签

Question

使用 Jsoup 可以轻松计算特定标签在文本中出现的次数。例如，我试图查看给定文本中存在多少次锚标记。

    String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    Document doc = Jsoup.parse(content);
    Elements links = doc.select("a[href]"); // a with href
    System.out.println(links.size());

这给了我 4 的计数。如果我有一个句子并且我想知道该句子是否包含任何 html 标签，那么 Jsoup 可以吗？谢谢你。

score 1 · Accepted Answer

使用正则表达式可能会更好，但是如果你真的想使用JSoup，那么你可以尝试匹配所有元素，然后减去4，因为JSoup会自动添加四个元素，即首先是根元素，然后然后是<html>,<head>和<body>元素。

这可能看起来像：

// attempt to count html elements in string - incorrect code, see below 
public static int countHtmlElements(String content) {
    Document doc = Jsoup.parse(content);
    Elements elements = doc.select("*");
    return elements.size()-4;
}

但是，如果文本包含,或;则会给出错误的结果 比较以下结果：<html><head><body>

// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted 
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));

因此，要完成这项工作，您必须单独检查“魔术”标签；这就是为什么我觉得正则表达式可能更简单。

更多失败的尝试使这项工作：使用parseBodyFragment而不是parse没有帮助，因为 JSoup 以相同的方式对其进行了清理。同样，计数 asdoc.select("body *");可以省去减去 4 的麻烦，但如果<body>涉及 a，它仍然会产生错误的计数。仅当您有一个应用程序，您确定要检查的字符串中不存在 no或元素时，它才可能在该限制下工作<html>。<head><body>

java - 使用 Jsoup 存在 HTML 标签

1 回答 1

Related

Reference