java - 如何在一行包含某些单词之前删除文本中的所有行

Question

我有一个大的 HTML 字符串，其中包含实际 HTML 代码之前的一些行，这些行是空的 HTML，实际上并不需要。

messageContent 将包含以下内容：

        <td width="35"><br /> </td> 
        <td width="1"><br /> </td> 
        <td width="18"><br /> </td> 
        <td width="101"><br /> </td> 
        <td width="7"><br /> </td> 
        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

我想删除/替换包含“Geachte”、“heer”和“mevrouw”的行之前的所有内容。

作为输出，我只想保留：

        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

我想我会使用 BufferedReader 逐行遍历文本：

try {
            reader = new BufferedReader(
                    new StringReader(messageContent));
        } catch (Exception failed) { }


        try {
            while ((string = reader.readLine()) != null) {

                if ((string.length() > 0) && (string.contains("Geachte"))) {
                    //remove all lines before this string
                }
            }
        } catch (IOException e) { }

我如何实现这一目标？

score 2 · Accepted Answer

这段代码会做到这一点。

public String cutText(String messageContent){
    boolean matchFound = false;
    StringBuilder output = new StringBuilder();
    try {
        reader = new BufferedReader(
                new StringReader(messageContent));
    } catch (Exception failed) { failed.printStacktrace(); }


    try {
        while ((string = reader.readLine()) != null) {

            if ((string.length() > 0) && (string.contains("Geachte"))) {
               matchFound = true;
            }
            if(matchFound){
                 output.append(string).append("\\n");
            }
        }
     } catch (IOException e) { e.printStacktrace();}
     return output.toString();
}

score 1 · Accepted Answer

最简单的方法是使用Xpath。首先，您需要知道tr要删除的正确路径。您可以使用Chrome 开发人员工具（F12在 Linux/Windows 上，Cmd+Alt+I在 Mac 上），元素选项卡，选择您想要的元素（使用镜面玻璃），右键单击并选择Copy Xpath。

由于您的内容是一个字符串（无文件），您只需将其复制粘贴一次（例如在调试时）到一个 html 文件中，然后用 Chrome 打开它。如果你给错误块的父级一个 unique 会更安全id，因为 xpath 会更短并且更不可能改变。

这会给你类似的东西：

//*[@id="answers-header"]/div/h2

首先，您需要将字符串转换为文档：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader("your string")));

然后在文档上应用 xpath：

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(<xpath_expression>);
NodeList nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

并删除无效节点：

for (int i = 0; i < nodes.getLength(); i++) {
      Element node = (Element)nodes.item(i);
      node.getParentNode().removeChild(person);
}

然后您需要将文档转换回字符串。

java - 如何在一行包含某些单词之前删除文本中的所有行

2 回答 2

Related

Reference