java - j2me中的HTML文本提取

Question

我有一个来自 html 网页的字符串，如下所示：

String htmlString =

<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great 
tributes to Motilal Nehru on occasion of 
</span>
150th birth anniversary. Pranab said institutions evolved by 
leaders like him should be strengthened instead of being destroyed. 
<span style="mso-spacerun:yes">&nbsp;
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of 
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,   
the first set of coins and postal stamps released at the function to commemorate the event.
</p>

我需要从上面的字符串中提取文本，提取后我的输出应该看起来像

输出：

President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed.  He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.

为此，我使用了以下逻辑：

int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex  + 1, endspanndex);

我的结果输出是：

President Pranab pay great tributes to Motilal Nehru on occasion of

我使用了不同的 HTMLParsers，但在 j2me 的情况下它们不起作用

谁能帮我获得完整的描述文本？谢谢 .....

score 2 · Accepted Answer

If you are using BlackBerry OS 5.0 or later you can use the BrowserField to parse HTML into a DOM document.

score 1 · Accepted Answer

我们可以在 j2me 的情况下提取文本，因为它不支持 HTMLParsers，如下所示：

private String removeHtmlTags(String content) {

        while (content.indexOf("<") != -1) {

            int beginTag;
            int endTag;

            beginTag = content.indexOf("<");
            endTag = content.indexOf(">");
            if (beginTag == 0) {
                content = content.substring(endTag
                        + 1, content.length());
            } else {
                content = content.substring(0, beginTag) + content.substring(endTag
                        + 1, content.length());
            }
        }
        return content;
    }

score 1 · Accepted Answer

您可以继续按照您建议的方式处理其余字符串。或者，一个简单的有限状态自动机可以解决这个问题。我在 moJab 过程中看到了这样的解决方案（您可以在此处下载源代码）。在mojab.xml包中，有一个为 j2me 设计的简约 XML 解析器。我的意思是它也会解析你的例子。看看来源，它只是三个简单的类。它似乎无需修改即可使用。

score 0 · Accepted Answer

JSoup是一个非常流行的用于从 HTML 文档中提取文本的库。这是一个这样的例子。

java - j2me中的HTML文本提取

4 回答 4

Related

Reference