java - 找出 HTML 代码是否代表可见的文本/图像

Question

我有一个包含一些 HTML 代码的字符串。我想知道 HTML 代码是代表可见文本还是图像。我使用 Java 使用以下正则表达式解决了这个问题（我知道你不能使用 RegExps 解析 HTML，但我认为我所掌握的 RegExps 就足够了）。

public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>";
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>"; 
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>"; 

public static String[] HTMLWhiteSpaces = {"&nbsp;", "&#160;"};

使用这些 RegExps 的代码适用于字符串，如

<h2></h2>

或类似的。但是一个字符串

<img src="someImage.png"></img>

也被认为是空的。

有没有人比使用 RegExps 来确定某些 HTML 代码在被浏览器解释时是否真的代表人类可读文本更好的主意？或者你认为我的方法最终会成功吗？

提前非常感谢。

score 2 · Accepted Answer

尝试使用JSoup。它让您可以使用 css 选择器（jquery 样式）解析 HTML 文档。

选择所有非空元素的一个非常简单的示例是：

Document doc = Jsoup.connect("http://my.awesome.site.com").get();
Elements nonEmpties = doc.select(":not(:empty)");

成熟的解决方案当然需要做一些额外的工作，比如

迭代元素列表，
检查 CSS 样式（display或visibility或大小或覆盖元素）
检查src图像的属性
ETC

但这绝对值得。您将学习一个新框架，发现在 HTML / CSS 中“隐藏”内容的可能性，并且 - 最重要的是 - 停止使用正则表达式进行 HTML 解析 ;-)

score 1 · Accepted Answer

我想出了以下代码，它在我不必考虑不可见元素的设置中运行良好。

// HTML white spaces that might occur in between tags; this list probably needs to be extended
public static String[] HTML_WHITE_SPACES = {"&nbsp;", "&#160;"};

/**
 * check if the given HTML text contains visible text or images
 * 
 * @param htmlText String the text that is checked for visibility
 * @return boolean    (1) true if the htmlText contains some visible elements 
 *                 or (2) false in case (1) does not hold
 */
public static boolean containsVisibleElements(String htmlText) {

    // do not analyze the HTML text if it is blank already
    if (StringUtil.isBlank(htmlText)) {
        return false;
    }

    // the string from which all whitespaces are removed
    String htmlTextRemovedWhiteSpaces = htmlText; 

    // first, remove white spaces from the string
    for (String whiteSpace: HTML_WHITE_SPACES) {
        htmlTextRemovedWhiteSpaces = htmlTextRemovedWhiteSpaces.replaceAll(whiteSpace, "");
    }

    // the HTML text is blank 
    if (StringUtil.isBlank(htmlTextRemovedWhiteSpaces)) {
        return false;
    }

    // parse the HTML text from which the white space have been removed
    Document doc = Jsoup.parse(htmlTextRemovedWhiteSpaces);

    // find real text within the body (and its children)
    String text = doc.body().text(); 

    // there exists visible text
    if (!StringUtil.isBlank(text.trim())) {
        return true;
    }

    // now we know that there does not exist visible text and that the string 
    // htmlTextRemovedWhiteSpaces is not blank

    // look for images as they are visible and not a text ;-)
    Elements images = doc.select("img");

    // there do not exist any image elements
    if (images.isEmpty()) {
        return false;
    }       

    // none of the above checks succeeded, so there must exist some visible elements such as text or images
    return true;
}

java - 找出 HTML 代码是否代表可见的文本/图像

2 回答 2

Related

Reference