xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？

Question

背景：我们正在汇总来自某些网站的内容（经许可），以用于其他应用程序的补充搜索功能。一个例子是https://centenary.bahai.us的新闻部分。我们考虑为此目的使用 xidel，因为模板文件范例似乎是从 html 中提取数据的一种优雅方式，例如对于模板：

<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
  {$url:=$url}
  <div class="field-field-audio">?
    <audio src="{$audio:='https://' || $host || .}"></audio>?
  </div>?
  <div class="field-field-clip-img">
    <a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
  </div>?
  <div class="field-field-pubname">{$publication}</div>?
  <div class="field-field-historical-date">{$date}</div>?
  <div class="location"><div class="adr">{$location}</div>?</div>?
  <div class="node-body">{$text}</div>
</div>?

...我们可以运行如下命令：

xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json

...这将为我们提供来自 centenary.bahai.us 上所有新闻页面的 json 格式数据。示例文章如下所示：

{
"title": "Bahá’ísm the Religion of Brotherhood", 
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood", 
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3", 
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png", 
"publication": "The New York Times", 
"date": "March 31, 1912", 
"location": "New York, NY", 
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},

这很漂亮，而且比 httrack 和 pup 或（上帝保佑）sed 和 regex 的一些混搭要容易得多，但有一些问题：

我们希望每个文档都有单独的文件，而这给了我们一个大的 json 文件。
即使使用该--silent标志，我们仍然会在输出中获得使 json 无效的状态消息，例如**** Retrieving (GET): https://centenary.bahai.us ****or**** Processing: https://centenary.bahai.us/ ****或** Current variable state: **
这个过程似乎太脆弱了；如果模板和实际的 html 之间有任何差异，整个过程就会出错，我们什么也得不到。我们希望它只输出一个页面的错误，然后继续下一个 URL。

Xidel 似乎是一个改变游戏规则的工具，它应该可以通过一行命令和一个简单的提取模板文件来完成这项工作；我在这里想念什么？

score 0 · Accepted Answer

您可以在 Xidel 中构建自己的输出。

您可以使用XQuery 中的文件模块将 JSON 保存到文件中，例如：

file:write-text("/tmp/test.json", serialize-json({"a": 1}))

您可以将任何 JSON 传递给 serialize-json

 { "title": $title, "url": $url, "audio": $audio}

或者从变量列表中构建它：

 {| ("title", "url", "audio") ! {.: get(.)}  |}

有多种方法可以捕获错误：

在 bash 中多次调用 Xidel

老派的方式：

xidel --silent "https://centenary.bahai.us/news" -e "//a[contains(@href, '/news/')]/resolve-html(.)"  | sort | uniq | 
while read -r u; do 
  xidel $u -e @template.html >> output.json
done

使用多页模板

您可以将模板模式放在模板 xml 文件中，以获得比命令行更大的控制：

  <action>
    <page url="https://centenary.bahai.us/news"/>
    <s>$temp := distinct-values( //a[contains(@href, '/news/')]/resolve-html(.) )</s>

    <loop var="u" list="$temp">
      <page url="{$u}"/>
      <try>
        <pattern><body>
          <h1 class="title">{$title}</h1>?
          <div class="node build-mode-full">
            {$url:=$url}
            <div class="field-field-audio">?
              <audio src="{$audio:='https://' || $host || .}"></audio>?
            </div>?
            <div class="field-field-clip-img">
              <a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
            </div>?
            <div class="field-field-pubname">{$publication}</div>?
            <div class="field-field-historical-date">{$date}</div>?
            <div class="location"><div class="adr">{$location}</div>?</div>?
            <div class="node-body">{$text}</div>
          </div>?
        </body></pattern>
        <catch>
        </catch>
      </try>
    </loop>

  </action>

<loop>重复某事并<try><catch>捕获所有错误。

然后用xidel --template-file action.xml

从 XQuery 调用模式匹配

您也可以使用函数从 XQuery 调用模式匹配pxp:match，尽管您不能使用特殊的内置变量（如 $url 或 $host。顺便说一句，避免修改这些变量，不确定如果这样做会发生什么。）：

 xidel  'https://centenary.bahai.us/news' --xquery '
    let $pattern := doc("file://./template.html")
    for $u in distinct-values( //a[contains(@href, "/news/")]/resolve-html(.) )
    return try {
      pxp:match($pattern, doc($u))
    } catch * { 
       "error"
    }
 '

score 0 · Accepted Answer

从使用情况来看，$(< template.html)我猜你是在 Linux 发行版上。在这种情况下，您的报价是错误的。请参见此处的#9 和 #10 。

由于您使用的是提取模板文件，我会说--extract-file=template.html这是要使用的参数，但您-e "$(< template.html)"似乎也可以使用。这对我来说是新的。谢谢。
多亏了贝尼贝拉的回答，我才知道-e @template.html也有效。

接下来是参数的错误顺序。我不得不承认，Xidel 的自述文件对此并不太清楚。
紧随其后，您显然必须先“关注”一个网址，然后才能进行提取xidel。--silent --color=never所以这应该工作：

$ xidel --silent --color=never "https://centenary.bahai.us" \
  -f '//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href' \
  --extract-file=template.html \
  --output-format=json-wrapped \
  > index.json

我自己几乎从不使用模板，所以我会通过自己构建 json 来做一些不同的事情：

$ xidel -s --no-check-certificate "https://centenary.bahai.us" -e '
  for $x in //div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href return
  file:write(
    substring-after($x,"/news/")||".json",
    doc(resolve-uri($x))/{
      "title"://h1/text(),
      "url":$url,
      "audio"://audio/resolve-uri(@src),
      "image"://div[ends-with(@class,"clip-img")]//img/resolve-uri(@src),
      "publication"://div[ends-with(@class,"pubname")]/div/normalize-space(div[@class="field-item odd"]),
      "date"://div[ends-with(@class,"historical-date")]//span/text(),
      "location"://span[@class="locality"]/text(),
      "text":string-join(//div[@class="node-body"]//text())
    },
    {"method":"json","indent":true()}
  )
'

--no-check-certificate由于证书无效，根本需要打开 url。
//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href返回当前新闻文章的相对路径：

/news/visit-abdul-baha-abbas
/news/abdul-baha-prays-ascension-church
/news/bahaist-leader-here-interest-world-peace
/news/abdul-baha-abbas-coming-lewis-g-gregory

对于每篇新闻文章，都会打开 url，解析 html 源并将提取的信息保存为缩进/美化 JSON 文件。第一个，visit-abdul-baha-abbas.json例如：

{
  "title": "A Visit to ‘Abdu’l-Bahá Abbas",
  "url": "https:\/\/centenary.bahai.us\/",
  "audio": "https:\/\/centenary.bahai.us\/sites\/default\/files\/257_0.mp3",
  "image": "https:\/\/centenary.bahai.us\/sites\/default\/files\/imagecache\/page-secondary-images\/images\/press_clippings\/04-17-1912%20Utica%20NY%20Press%20A%20Visit%20to%20Abdul%20Baha%20Abbas.png",
  "publication": "Utica New York Press",
  "date": "April 17, 1912",
  "location": "Acca",
  "text": "An American Girl Tells of a Memorable Experience in Her Life.[...]"
}

xidel - 我们可以使用 Xidel 将整个站点的数据提取到搜索文件中吗？

2 回答 2

在 bash 中多次调用 Xidel

使用多页模板

从 XQuery 调用模式匹配

Related

Reference