java - Java正则表达式：没有哈希的href

Question

我正在尝试构建一个站点地图并解析href没有的 s的 html 正文#（因为带有哈希的那些只是某些内容页面 html 中的子章节链接）。

我现在的正则表达式：<a\\s[^>]*href\\s*=\\s*\"([^\"]*)\"[^>]*>(.*?)</a> 我想我应该使用[^#]或!#排除#from hrefs，但无法通过尝试和谷歌搜索来解决它。提前感谢您帮助我！

score 1 · Accepted Answer

完成了。#刚刚在[^\"]块中插入了。:D

<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>

score 1 · Accepted Answer

您不应该使用正则表达式来解析 HTML。

最好使用 HTML 解析器，例如http://jsoup.org，然后

Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");

for (Element each: links) {
    if (each.attr("href").startsWith("#")) continue;
    ...
}

比使用正则表达式要轻松得多，嗯！

java - Java正则表达式：没有哈希的href

2 回答 2

Related

Reference