0

我想从以下 HTML 解析第二个 div:

<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>

即,这个值:<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>

id 可以包含任何数字。

这是我正在尝试的:

Regex rgx = new Regex(@"'post-body-\d*'");
var res = rgx.Replace("<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>", "");

我期待结果<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>,但这不是我得到的。

4

3 回答 3

1

如果您 100% 确定数字之前和之后的文本将始终相同,则可以使用 String 类的 .IndexOf 和 .Substring 方法将字符串分成几部分。

string original = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"

// IndexOf returns the position in the string where the piece we are looking for starts
int startIndex = original.IndexOf(@"<div class='post-body entry-content' id='post-body-");
// For the endIndex, add the number of characters in the string that you are looking for
int endIndex = original.IndexOf(@"' itemprop='articleBody'>") + 25;

// this substring will retrieve just the inner part that you are looking for
string newString = original.Substring(startIndex, endIndex - startIndex);

// newString should now equal "<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>"


// or, if you want to just remove the inner part, build a different string like this:
// First, get everything leading up to the startIndex
string divString = original.Substring(0, startIndex);
// then, add everything after the endIndex
divString += original.Substring(endIndex);

// divString should now equal "<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>"

希望这可以帮助...

于 2012-08-03T14:11:44.230 回答
1

您没有得到预期结果的原因是您的正则表达式字符串只搜索'post-body-\d*',而不是div标签的其余部分。此外,执行Regex.Replace实际上会替换您正在搜索的文本,而不是返回它,因此您最终会得到正在搜索的文本之外的所有内容。

"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>"尝试使用Regex.Matches(或Regex.Match如果您只关心第一次出现)用 @ 替换您的正则表达式字符串,并处理Matches

例如:

string htmlText = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>";

Regex rgx = new Regex(@`"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>");
foreach (Match match in rgx.Matches(htmlText))
{
    // Process matches
    Console.WriteLine(match.ToString());
}
于 2012-08-03T14:15:10.957 回答
0

您可以将您的 HTML 片段解析为 XML 片段并id直接提取属性,例如

var html = "<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
var data = XElement.Parse(html).Element("div").Attribute("id");
于 2012-08-03T14:03:20.080 回答