-6

我正在尝试从谷歌产品页面抓取特定的 html 标签,包括它们的数据。我想得到这个有序列表中的所有 <li> 标签并将它们放在一个列表中。

这是代码:

   <td valign="top">
        <div id="center_col">
          <div id="res">
            <div id="ires">
              <ol>
                   <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>

                 <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>

              <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>
                <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>
              </ol>
            </div>
          </div>
        </div>

        <div id="foot">
          <p class="flc" id="bfl" style="margin:19px 0 0;text-align:center"><a href=
          "/support/websearch/bin/answer.py?answer=134479&amp;hl=en">Search Help</a>
          <a href=
          "/quality_form?q=Pioneer+Automotive+PF-555-2000&amp;hl=en&amp;tbm=shop">Give us
          feedback</a></p>

          <div class="flc" id="fll" style="margin:19px auto 19px auto;text-align:center">
            <a href="/">Google&nbsp;Home</a> <a href=
            "/intl/en/ads">Advertising&nbsp;Programs</a> <a href="/services">Business
            Solutions</a> <a href="/intl/en/policies/">Privacy &amp; Terms</a> <a href=
            "/intl/en/about.html">About Google</a>
          </div>
        </div>
      </td>

我想获取每个<li class="g">标签中的所有标签和数据。那可能吗?

4

3 回答 3

2

而不是使用 xml 解析器之类的正则表达式可能对您的情况更有用。将其加载到 xml 文档中,然后使用 SelectNodes 之类的东西来获取您正在寻找的数据

http://msdn.microsoft.com/en-us/library/4bektfx9.aspx

于 2012-05-21T15:18:01.313 回答
1

I wouldn't use regex for this particular problem.

Instead I would attack it thus:

1)Save off page as html string. 2)Use aforementioned htmlagilitypack or htmltidy(my preference) to convert to XML. 3)Use xDocument to navigate through Dom object by tag and save data.

Trying to create a regex to extract data from a possibly fluid HTML page will break your heart.

于 2012-05-21T15:41:07.353 回答
0

您可以使用正则表达式HtmlAgilityPack来解析 HTML,而不是使用正则表达式。

var doc = new HtmlDocument();
doc.LoadHtml(html);
var listItems = doc.DocumentNode.SelectNodes("//li");

上面的代码将为您提供文档中的所有<li>项目。要将它们添加到列表中,您只需迭代集合并将每个项目添加到列表中。

于 2012-05-21T15:35:10.383 回答