背景:我们正在汇总来自某些网站的内容(经许可),以用于其他应用程序的补充搜索功能。一个例子是https://centenary.bahai.us的新闻部分。我们考虑为此目的使用 xidel,因为模板文件范例似乎是从 html 中提取数据的一种优雅方式,例如对于模板:
<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
{$url:=$url}
<div class="field-field-audio">?
<audio src="{$audio:='https://' || $host || .}"></audio>?
</div>?
<div class="field-field-clip-img">
<a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
</div>?
<div class="field-field-pubname">{$publication}</div>?
<div class="field-field-historical-date">{$date}</div>?
<div class="location"><div class="adr">{$location}</div>?</div>?
<div class="node-body">{$text}</div>
</div>?
...我们可以运行如下命令:
xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json
...这将为我们提供来自 centenary.bahai.us 上所有新闻页面的 json 格式数据。示例文章如下所示:
{
"title": "Bahá’ísm the Religion of Brotherhood",
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood",
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3",
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png",
"publication": "The New York Times",
"date": "March 31, 1912",
"location": "New York, NY",
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},
这很漂亮,而且比 httrack 和 pup 或(上帝保佑)sed 和 regex 的一些混搭要容易得多,但有一些问题:
- 我们希望每个文档都有单独的文件,而这给了我们一个大的 json 文件。
- 即使使用该
--silent
标志,我们仍然会在输出中获得使 json 无效的状态消息,例如**** Retrieving (GET): https://centenary.bahai.us ****
or**** Processing: https://centenary.bahai.us/ ****
或** Current variable state: **
- 这个过程似乎太脆弱了;如果模板和实际的 html 之间有任何差异,整个过程就会出错,我们什么也得不到。我们希望它只输出一个页面的错误,然后继续下一个 URL。
Xidel 似乎是一个改变游戏规则的工具,它应该可以通过一行命令和一个简单的提取模板文件来完成这项工作;我在这里想念什么?