java - Neko 解析器正在剥离
解析 HTML 字符串时的标记

Question

我有一个要转换为 DocumentFragment 的字符串。问题是孩子<ul><li>... </li></ul>们被完全剥夺了。我不知道为什么会这样。

我需要添加或更新任何配置吗？

输入

<div class="faq-content-area">
<p>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</p>
<ul>
<li>A <a target="_self" href="/deposits/savings/rewards-money-market-savings-account.go" id="rmms-prtfaq" name="">Rewards Money Market Savings account</a> to receive the money market savings interest rate booster</li>
<li>An eligibile <a target="_self" href="/credit-cards/overview.go" id="creditcard-prtfaq" name="">Bank of America credit card</a>, such as BankAmericard Cash Rewards&trade; or BankAmericard Travel Rewards<sup>&reg;</sup>, to receive the credit card rewards bonus</li>
</ul>
<p>After you enroll in Preferred Rewards, you can talk to a specialist to convert your existing money market savings account to a Rewards Money Market Savings account or to open a new credit card account that’s eligible for the rewards bonus.</p>
<p>If you already have a Rewards Money Market Savings account or an eligible credit card, you’ll automatically receive Preferred Rewards benefits after you enroll.</p>
</div>

输出如下

<DIV class="faq-content-area hide">
<P>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</P>    
<UL>
</UL>
</DIV>

我不知道为什么会这样。

Java程序

InputStream is = null;
    BufferedReader  br = null;
    InputSource iss = null;
    try {
      is = ClassLoader.getSystemResourceAsStream("test.txt");
      iss = new InputSource (is);
      DocumentFragment documentFragment = qaParser.parse(iss);
      System.out.println(qaParser.serialize(documentFragment));
      try {
        Path path = Paths.get("./qaAnswers.txt");
        //Files.write(path, sb.toString().getBytes(StandardCharsets.UTF_8));
        Files.write(
            path, 
            qaParser.serialize(
                qaParser.parse(content)).getBytes(StandardCharsets.UTF_8));
      } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
    } finally {
      if (is != null) {
        is.close();
      }
      if (br != null) {
        br.close();
      }
    }

创建 DocumentFragment 对象。

DocumentFragment parse(InputSource input) throws Exception {
    DOMFragmentParser parser = new DOMFragmentParser();
    try {
      parser.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe",
              true);
      parser.setFeature("http://cyberneko.org/html/features/augmentations",
              true);
      parser.setProperty("http://cyberneko.org/html/properties/default-encoding",
              defaultCharEncoding);
      parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",
              true);
      parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content",
              false);
      parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",
              true);
      parser.setFeature("http://cyberneko.org/html/features/report-errors",
              LOG.isTraceEnabled());
    } catch (SAXException e) {}
    // convert Document to DocumentFragment
    HTMLDocumentImpl doc = new HTMLDocumentImpl();
    doc.setErrorChecking(false);
    DocumentFragment res = doc.createDocumentFragment();
    DocumentFragment frag = doc.createDocumentFragment();
    parser.parse(input, frag);
    res.appendChild(frag);

    try {
      while(true) {
        frag = doc.createDocumentFragment();
        parser.parse(input, frag);
        if (!frag.hasChildNodes()) break;
        if (LOG.isInfoEnabled()) {
          LOG.info(" - new frag, " + frag.getChildNodes().getLength() + " nodes.");
        }
        res.appendChild(frag);
      }
    } catch (Exception e) { 
      LOG.error("Error: ", e);
      };
    return res;
  }

序列化函数

// Custom method to serialize HTML.
String serialize(Node node) {
  try {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
       transformer.setOutputProperty(OutputKeys.METHOD, "html");
       StringWriter sw = new StringWriter();
       transformer.transform(new DOMSource(node), new StreamResult(sw));
       return sw.toString();
  } catch (Exception e) {
       e.printStackTrace();
       return null;
  }
}

java - Neko 解析器正在剥离解析 HTML 字符串时的标记

0 回答 0

Related

Reference

java - Neko 解析器正在剥离
解析 HTML 字符串时的标记