0

我正在创建一个 HTML 解析器,它从给定的 URL 获取 HTML,找到导航菜单 html,并将其放入字符串中。HTML 中被复制到字符串中的 URL 需要添加部分 URL(“www.stackoverflow.com”部分)。我如何才能找到字符串中的现有 URL 并将缺少的部分添加到其中以便它们工作。

中的 URL 采用String以下形式:

<a href="/qestions/11744851.cfm">

我需要按照以下形式制作它们:

<a href="www.stackoverflow.com/questions/11744851.cfm">
4

2 回答 2

1

If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a@href , where /body/div is path to menu section in HTML. There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much). If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.

Edit: adding missing part may be done using simple concatenation after finding interesting parts

于 2012-07-31T16:43:32.973 回答
1

尝试将此正则表达式与以下ReplaceAll()方法一起使用:

str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
于 2012-07-31T17:15:20.450 回答