html - 在 lisp 中抓取 HTML

Question

我的问题与此处发现的另一个问题有关Scraping an HTML table in Common Lisp?

我正在尝试从普通 lisp 的网页中提取数据。我目前正在使用 drakma 发送 http 请求，并且我正在尝试使用 chtml 来提取我正在寻找的数据。我要废弃的网页是http://erg.delph-in.net/logon，这是我的代码

(defun send-request (sentence)
 "sends sentence in an http request to logon for parsing, and recieves
  back the webpage containing the MRS output"
 (drakma:http-request "http://erg.delph-in.net/logon" 
                   :method :post 
                   :parameters `(("input" . ,sentence)
                                 ("task" . "Analyze")
                                 ("roots" . "sentences")
                                 ("output" . "mrs")
                                 ("exhaustivep" . "best")
                                 ("nresults" . "1"))))

这是我遇到问题的功能

(defun get-mrs (sentence)
    (let* (
       (str (send-request sentence))
       (document (chtml:parse str (cxml-stp:make-builder))))
      (stp:filter-recursively (stp:of-name "mrsFeatureTop") document)))

基本上我需要提取的所有数据都在一个 html 表中，但它太大了，无法粘贴到这里。在我的 get-mrs 函数中，我只是想获取名为 mrsFeatureTop 的标签，但我不确定这是否正确，因为我收到一个错误：不是 NCName 'onclick。任何有关刮桌子的帮助将不胜感激。谢谢你。

score 3 · Accepted Answer

古老的问题，我知道。但是一个让我很长时间失败的人。确实很多网页都是垃圾，但几乎整个 2.0 都是建立在屏幕抓取的基础上的，将异构网站与 hack on hack 集成在一起——应该是 Lisp 的理想应用程序！

关键（除了 drakma）是lquery，它允许您使用 css 选择器（jquery 使用的）的 lispy 音译来访问页面内容。

让我们从 Google 新闻页面上的媒体条中获取链接！如果您在浏览器中打开https://news.google.com并查看源代码。页面的复杂性会让您不知所措。但是，如果您在浏览器开发面板（Firefox：F12，Inspector）中查看该页面，您会看到该页面有一些逻辑。使用搜索框找到 .media-strip-table 该元素包含我们想要的图像。现在打开你最喜欢的repl。（好吧，说实话，Emacs M-x slime：）

(ql:quickload '(:drakma :lquery))

;;; Get the links from the media strip on Google's news page.
(defparameter response  (drakma:http-request "https://news.google.com/"))

;;; lquery parses the page and gets it ready to be queried.
(lquery:$ (initialize http-response))

现在让我们来看看结果

;;; package qualified '$' opperator, Barbaric!  
;;; Use (use-package :lquery) to omit the package prefix.
(lquery:$ ".media-strip-table" (html))

哇！这只是页面的一小部分？好的，第一个元素怎么样？

(elt (lquery:$ ".media-strip-table" (html)) 0)

好的，这更易于管理。让我们看看那里是否有一个图像标签，Emacs：C-s img 耶！它在那里。

(lquery:$ ".media-strip-table img" (html))

嗯......它正在寻找一些东西，但只返回空文本......哦，是的，图像标签应该是空的！

(lquery:$ ".media-strip-table img" (attr :src))

哇靠！gif 不只是用于不有趣的、颗粒状的动画吗？

html - 在 lisp 中抓取 HTML

1 回答 1

Related

Reference