1

我想处理一个(定义非常差的)html,它的信息成对分组,如下所示:

<html>
<body>
<table>
 <tr>
     <td>
         <font >
         <a href="a">ABC</a></font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-1-2013</font>
          <b><font>&nbsp; </font></b>
         <font>Where:</font><font>Here</font>
         <font>Who:</font><font>Me</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
             <a href="b">EFG</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-2-2013</font>
         <b><font>&nbsp; </font></b>
         <font>Where:</font><font>There</font>
         <font>Who:</font><font>You</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
            <a href="c">HIJ</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-3-2013</font><b>
         <font>&nbsp; </font></b>
         <font>Where:</font><font>Far away</font>
         <font>Who:</font><font>Him</font>
     </td>
 </tr>
</table>
</body>
</html>

对此,经过几次迭代,我得到了这段代码来实现我想要的:

import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)

countRows html = html >>> deep (hasName "tr") >. length

parsePage sz html = let
  n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
  m = deep (hasName "td") >>> css "a" /> getText
  o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
  p x = (((n x) >>> m) &&& ((n x) >>> o))
  in html >>> catA [p x | x <- [0..sz]]

main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    count <- (runX . countRows) html
    let cnt = ((head count) `div` 2) - 1
    prcssd <- (runX . (parsePage cnt)) html
    print prcssd

结果是: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]

但是,我认为这不是一个很好的方法,必须先计算行数。有没有更好的方法使用 HXT 进行这种分组?我已经尝试过 &&& 运算符,但运气不佳。

我相信,使用 hxt 提取多个 html 表的问题虽然有用,但呈现出更简单的情况。

4

2 回答 2

3

几周前我用 hxt 做了一些 html 解析,并认为xpath非常方便。不幸的是,我没有为您的问题想出一个完美的解决方案,但这可能是一个新尝试的开始。

import Text.XML.HXT.Core
import Text.XML.HXT.XPath.Arrows

type XmlTreeValue a = a XmlTree String
type ParsedXmlTree a = a XmlTree XmlTree
type IOXmlTree = IOSArrow XmlTree XmlTree

-- parses a given .html file
parseHtml :: FilePath -> IOStateArrow s b XmlTree
parseHtml path = readDocument [withParseHTML yes, withWarnings no] path

-- "" for stdout
saveHtml :: IOXmlTree
saveHtml = writeDocument [withIndent yes] ""

extract :: IOXmlTree
extract = processChildren (process `when` isElem)

-- main processing functon
processHtml :: FilePath -> IO ()
processHtml src =
  runX (parseHtml src >>> extract >>> saveHtml)
   >> return ()

-- process the html structure
process :: ArrowXml cat => ParsedXmlTree cat
process =
  -- create tag <structure> for the expression given next
  selem "structure"
    -- navigate to <html><body><table><tr>...
    [(getXPathTrees "/html/body/table/tr")
      -- then combine the results
      >>> (getTheName <+> getWhere)]

 -- selects text at path <td><font><a...> </a></font></td> and creates <name>-Tag
 -- (// means that all <td>-tags are analysed,
 --  but I'm not quite sure why this is relevant here)
 getTheName :: ArrowXml cat => ParsedXmlTree cat
 getTheName = selem "name" [getXPathTrees "//td/font/a/text()"]

 -- selects text at path <td><font><a...> </a></font></td>
 -- (where the forth font-tag is taken) and creates <where>-Tag
 getWhere  :: ArrowXml cat => ParsedXmlTree cat
 getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"]

结果如下所示:

*Main> processHtml "test.html"
<?xml version="1.0" encoding="UTF-8"?>
<structure>
 <name>ABC</name>
 <where/>
 <name/>
 <where>Here</where>
 <name>EFG</name>
 <where/>
 <name/>
 <where>There</where>
 <name>HIJ</name>
 <where/>
 <name/>
 <where>Far away</where>
</structure>

就像我说的,不是很完美,但希望是一个开始。

编辑:也许这看起来更像你的方法。尽管如此,我们不会删除您不关心的元素,而是首先选择所有适合并过滤结果的元素。我认为对于这样的问题没有通用的方法是非常有趣的。因为,不知何故,字体[4] 选择不适用于我的其他方法 - 但也许我不是一个好的 xpath 用户。

processHtml :: FilePath -> IO [(String,String)]
processHtml src = do
  names <- runX (parseHtml src >>> process1)
  fontTags <- runX (parseHtml src >>> process2)
  let wheres = filterAfterWhere fontTags
  let result = zip names wheres
  return result
 where filterAfterWhere [] = []
       filterAfterWhere xs = case dropWhile (/= "Where:") xs of
                               []     -> []
                               [x]    -> [x]
                               _:y:ys -> y : filterAfterWhere ys

process1 :: ArrowXml cat => XmlTreeValue cat
process1 = textNodeToText getTheName

process2 :: ArrowXml cat => XmlTreeValue cat
process2 =  textNodeToText getWhere

getTheName :: ArrowXml cat => ParsedXmlTree cat
getTheName = getXPathTrees "//td/font/a/text()"

getWhere  :: ArrowXml cat => ParsedXmlTree cat
getWhere = getXPathTrees "//td/font/text()"

-- neet function to select a value within a XmlTree as String
textNodeToText :: ArrowXml cat => ParsedXmlTree cat -> XmlTreeValue cat
textNodeToText selector = selector `when` isElem >>> getText

这样,您将获得问题中显示的结果:

*Main> processHtml "test.html"
[("ABC","Here"),("EFG","There"),("HIJ","Far away")]

编辑2:

有趣的事实:看起来 hxt-xpath 库对于这样的索引选择不太合适。在线 XPath 评估器显示//td/font[4]/text().

于 2013-02-19T18:58:28.583 回答
2

这是一个更简单的实现。

import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 :: [a] -> [(a, a)]
group2 [] = []
group2 (x0:x1:xs) = (x0, x1) : group2 xs

parsePage :: ArrowXml a => a XmlTree (String, String)
parsePage = let
    trPairs    = deep (hasName "tr") >>. group2
    insideLink = deep (hasName "a") /> getText
    insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText

    in trPairs >>> (insideLink *** insideFont)


main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    prcssd <- runX $ html >>> parsePage
    print prcssd

>>.可以使用操作员代替,这样>.您以后就不需要打电话unlistA了。

我更改了group2函数以返回对列表,因为它更好地映射了我们想要实现的目标,并且更容易使用。

的类型trPairs

trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)

即它是一个箭头,它接收节点并输出一对节点(即成对的<tr>节点)。现在我们可以使用***操作符 fromControl.Arrow将转换应用于该对的任何一个元素,insideLink对于第一个元素和insideFont第二个元素。通过这种方式,我们可以通过一次遍历 HTML 树来收集和分组我们需要的所有内容。

于 2013-02-20T11:08:16.820 回答