1

从我的数据库中,我正在获取网站内容的正文列。

出于某种原因,一些href没有破折号就回来了(可能会逃脱),所以没有破折号href="/my-page"就回来href="my-page"了。

我需要知道如何更改正文列以查找href="<some value>"并向它们添加斜线。仅当它还没有斜线,或者它没有http:www 时。已经在它面前了。

有什么想法可以通过 html 进行解析吗?

4

2 回答 2

0

You may try this for some rough processing:

  1. Use href="([^"]+)" to find every link that actually points to some resource.
  2. Iterate over each found resource (group 1 of each match) and check if it starts with /, http:// or www.. If it doesn't, add a leading / and replace the original value in the code with the modified one (for the replacing, try replacing the value of the full match (group 0) with the modified value).
于 2013-07-14T21:13:25.993 回答
0

This may be something better addressed in your link retrieval, but I think this should do what you're after:

Regex.Replace(yourString, @"(href="")(?!www)(?!http)", "$1/"));

It will match and capture any href=" not followed by either www, or http. Then it just inserts a / after the matched group. It may well be a tad flakey with more complex strings.

于 2013-07-14T21:15:09.877 回答