haskell - HXT 获得第一个元素：重构奇怪的箭头

Question

我需要获取 first<p>的子项的文本内容<div class="about">，编写以下代码：

tagTextS :: IOSArrow XmlTree String
tagTextS = getChildren >>> getText >>> arr stripString

parseDescription :: IOSArrow XmlTree String
parseDescription =
  (
   deep (isElem >>> hasName "div" >>> hasAttrValue "id" (== "company_about_full_description"))
   >>> (arr (\x -> x) /> isElem  >>> hasName "p") >. (!! 0) >>> tagTextS
  ) `orElse` (constA "")

看看这个arr (\x -> x)——没有它我无法达到结果。

有没有更好的写法parseDescription？
另一个问题是为什么我需要括号之前arr和之后hasName "p"？（我实际上在这里找到了这个解决方案）

score 4 · Accepted Answer

XPath可能是这样的

import "hxt-xpath" Text.XML.HXT.XPath.Arrows (getXPathTrees)

...

xp = "//div[@class='about']/p[1]"

parseDescription = getXPathTrees xp >>> getChildren >>> getText

score 2 · Accepted Answer

根据您的要求使用 hxt 核心的另一个建议。

要强制第一个孩子，不能通过getChildren输出来完成，因为 hxt 箭头有一个特定的 (>>>)，它将后续箭头映射到先前输出的每个列表项而不是输出列表，如haskellWiki hxt 页面中所述，尽管这是一个旧定义，实际上它源自Category (.) composition。

getNthChild可以从Control.Arrow.ArrowTree的getChildren破解

import Data.Tree.Class (Tree)
import qualified Data.Tree.Class as T

-- if the nth element does not exist it will return an empty children list

getNthChild :: (ArrowList a, Tree t) => Int -> a (t b) (t b)
getNthChild n = arrL (take 1 . drop n . T.getChildren)

那么您的 parseDescription 可以采用以下形式：

-- importing Text.XML.HXT.Arrow.XmlArrow (hasName, hasAttrValue)

parseDescription = 
    deep (isElem >>> hasName "div" >>> hasAttrValue "class" (== "about") 
          >>> getNthChild 0 >>> hasName "p"
          ) 
    >>> getChildren >>> getText

更新。我找到了另一种使用changeChildren 的方法：

getNthChild :: (ArrowTree a, Tree t) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n) >>> getChildren

更新：避免元素间间距节点过滤非元素子级

import qualified Text.XML.HXT.DOM.XmlNode as XN

getNthChild :: (ArrowTree a, Tree t, XN.XmlNode b) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n . filter XN.isElem) >>> getChildren

haskell - HXT 获得第一个元素：重构奇怪的箭头

2 回答 2

Related

Reference