java - 我怎样才能为这个疯狂的事情写一个 REGEX 表达式？

Question

我是正则表达式的初学者，所以我遇到了麻烦。

给定下面的字符串，我如何编写一个只匹配“69144”的正则表达式？只要我可以缩小范围，一些周围的文字也可以。

Citations</a></td><td class="cit-borderleft cit-data">69144</td><td class="cit-borderleft
cit data">22047</td></tr><tr class="cit-borderbottom"><td class="cit-caption"><a href="#"
class="cit-dark-link" onclick="return citToggleIndexDef('h_index_definition')" title='
h-index is the largest number h such that h publications have at least h citations. 
The second column has the &quot;recent&quot; version of this metric which is the largest 
number h such that h publications have at least h new citations in the last 5 years.
 '>h-index</a></td><td class="cit-borderleft cit-data">88</td>

对于字符串非常难以阅读，我深表歉意。

score 0 · Accepted Answer

假设您尝试提取位于第一个 td 单元格中的数字，则搜索标记 start 和 end 并使用子字符串提取内容是比正则表达式更容易的方法。

// text contains the HTML from your question

int tdIndex = text.indexOf("<td");
int endTdIndex = text.indexOf(">", tdIndex + 1);
int endTdTagIndex = text.indexOf("</td>", endTdIndex + 1);

String numString = text.substring(endTdIndex + 1, endTdIndex - 1);

// numString now contains 69144

如果您需要 HTML 更深处的 td 单元格的内容，则可以通过在循环中使用以下内容来搜索以后的 td 标记：

tdIndex = text.indexOf("<td",tdIndex+1);

你必须知道你在使用哪个 td 标签（例如，“第三个 td”），并且知道它前面总是有相同数量的 td 标签，但是鉴于这两个假设，这段代码对你有用以最少的修改。

如果您不能对代码的格式做出假设，那么我支持 Reimeus 的回答，即 HTML 解析器可以证明非常有用。

score 0 · Accepted Answer

解析 HTML 的一种方法是使用XPath，这是一个用于 java 的包含库。XPath 所做的是遍历 XML/HTML 文档的“树”并获取节点的值（标签内的内容）。该库易于使用，易于学习，无需下载库。有关此主题的更多信息，请参阅New Think Tank Xpath 教程

java - 我怎样才能为这个疯狂的事情写一个 REGEX 表达式？

2 回答 2

Related

Reference