0

My goal is to parse a block of HTML code like below to obtain the text, comments and replies fields as separate parts of the block:

<div id='fooID' class='foo'>
<p>
    This is the top caption of picture's description</p>
<p>
    T=<img src="http://www.mysite.com/images/img23.jpg" alt="" width="64" height="108"/>       </p>
<p>
    And here is more text to describe the photo.</p> 
<div class=comments>(3 comments)</div>
<div id='reply13' class='replies'>
   <a href=javascript:getReply('13',1)>Show reply </a></div>
</div>

My problem is that Selenium's WebDriver does not seem to support non-string identifiers in the HTML (notice that the class field in the HTML is 'foo' and as opposed to "foo"). From all examples that I have seen in both the Selenium docs and in other SO posts, the latter format is what WebDriver commonly expects.

Here is the relevant part of my Java code with my various (unsuccessful) attempts:

java.util.List<WebElement> elementList =    driver.findElements(By.xpath("//div[@class='foo']"));
java.util.List<WebElement> elementList = (List<WebElement>)  ((JavascriptExecutor)driver).executeScript("return $('.foo')[0]");
java.util.List<WebElement> elementList = driver.findElements(By.xpath("//div[contains(@class, 'foo')]"));
java.util.List<WebElement> elementList = driver.findElements(By.cssSelector("div." + foo_tag)); // where foo_tag = "'foo'".replace("'", "\'");
java.util.List<WebElement> elementList = driver.findElements(By.cssSelector("'foo'"));

Is there a sure way of handling this? Or is there an alternative, better way of extracting the above fields? Other info:

  1. I'm an HTML noob, but have made efforts to understand the structure of the HTML code/tags
  2. Using Firefox (and, accordingly, FirefoxDriver)

Your help/suggestions greatly appreciated!

4

2 回答 2

1

这是无效的 HTML,所以 Selenium 没有机会。你应该修复它。

您将有更好的机会使用 HTMLAgilityPack:

http://htmlagilitypack.codeplex.com/

当涉及到格式不正确的(这是)HTML时,它会好一点。

下面是一个 SO 帖子,其中包含一些不同语言的一些不同选项,以及 HTMLAgilityPack 等工具。你应该找到一个合适的:

HTML抓取的选项?

于 2013-02-01T09:26:58.920 回答
0

问题是据我所知,html规范不知道单引号。因此,Selenum 网络驱动程序没有问题,问题在于 html。你有机会编辑html代码吗?

于 2013-02-01T05:33:43.943 回答