java - 在java中更正解析的URL

Question

我正在创建一个 HTML 解析器，它从给定的 URL 获取 HTML，找到导航菜单 html，并将其放入字符串中。HTML 中被复制到字符串中的 URL 需要添加部分 URL（“www.stackoverflow.com”部分）。我如何才能找到字符串中的现有 URL 并将缺少的部分添加到其中以便它们工作。

中的 URL 采用String以下形式：

<a href="/qestions/11744851.cfm">

我需要按照以下形式制作它们：

<a href="www.stackoverflow.com/questions/11744851.cfm">

score 1 · Accepted Answer

If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a@href , where /body/div is path to menu section in HTML. There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much). If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.

Edit: adding missing part may be done using simple concatenation after finding interesting parts

score 1 · Accepted Answer

尝试将此正则表达式与以下ReplaceAll()方法一起使用：

str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");

java - 在java中更正解析的URL

2 回答 2

Related

Reference