1

抓取网站并接收 HTML 页面。

该页面有一些带有行的表格

(演员 -> 角色)

例如:

(演员 = Jason Priestley -> 角色 = Brandon Walsh)

有时有些行缺少“演员”或“角色”

(预期为 2 时为 1 列的行)

文件示例:

<div id="90210">
      <h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
      <table class="actors">
        <tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
        <tr><td class="actor">Shannen Doherty</td></tr>
        <tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
      </table>
</div>

无法过滤掉只有 1 列的行:

我的代码:

  def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
    val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
    beverlyHillsData match {
      case Some(data) => {
        val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
        val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
        val roles  = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role")  map {_.text}
        actors zip roles  toMap
      }
      case None => Map()
    }
  }

主要关注的是这条线:

val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )

我怎样才能更精确地过滤掉坏行(没有 _.toString() )

有什么建议么 ?

4

1 回答 1

1

你可以

def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

val goodRows = data \\ "tr" filter actorWithRole

我还将更改数据提取以使演员/角色对完好无损。我需要更多时间来找出一个干净的解决方案

我的建议是

def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {

  def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

  def rowToEntry(r: Node) =
    r \ "td" map (_.text) match {
      case actor :: role :: Nil => (actor -> role)
    }  

  val beverlyHillsData = page \\ "div" find whereId("90210")

  beverlyHillsData match {
    case Some(data) => {
      val goodRows = data \\ "tr" filter actorWithRole
      val entries = goodRows map rowToEntry
      entries.toMap
    }
    case None => Map()
  }
}
于 2013-11-06T14:31:21.543 回答