抓取网站并接收 HTML 页面。
该页面有一些带有行的表格
(演员 -> 角色)
例如:
(演员 = Jason Priestley -> 角色 = Brandon Walsh)
有时有些行缺少“演员”或“角色”
(预期为 2 时为 1 列的行)
文件示例:
<div id="90210">
<h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
<table class="actors">
<tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
<tr><td class="actor">Shannen Doherty</td></tr>
<tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
</table>
</div>
无法过滤掉只有 1 列的行:
我的代码:
def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
beverlyHillsData match {
case Some(data) => {
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
val roles = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role") map {_.text}
actors zip roles toMap
}
case None => Map()
}
}
主要关注的是这条线:
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
我怎样才能更精确地过滤掉坏行(没有 _.toString() )
有什么建议么 ?