clojure - 使用 Enlive 重新抓取数据

Question

我试图创建从 HTML 页面中抓取和标记的函数，我将其 URL 提供给函数，并且它应该可以正常工作。我得到<h3>和<table>元素的序列，当我尝试使用 select 函数从结果序列中仅提取 table 或 h3 标签时，我得到 ()，或者如果我尝试映射这些标签，我得到 (nil nil nil ...)。

你能帮我解决这个问题，或者解释我做错了什么吗？

这是代码：

(ns Test2 
  (:require [net.cgrand.enlive-html :as html]) 
  (:require [clojure.string :as string])) 

(defn get-page 
  "Gets the html page from passed url" 
  [url] 
  (html/html-resource (java.net.URL. url))) 

(defn h3+table       
    "returns sequence of <h3> and <table> tags"
  [url] 
  (html/select (get-page url) 
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3] 
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]} 
               )) 

(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

这条线让我头疼：

(html/select (h3+table url) [:table])

你能告诉我我做错了什么吗？

只是为了澄清我的问题：是否可以使用 enlive 的 select 函数从 (h3+table url) 的结果中仅提取表标签？

score 2 · Accepted Answer

正如@Julien 指出的那样，您可能必须使用从(html/select raw-html selectors) 原始 html 中应用的深层嵌套树结构。似乎您尝试html/select多次申请，但这不起作用。html/select将 html 解析为 clojure 数据结构，因此您不能再次将其应用于该数据结构。

我发现解析网站实际上有点复杂，但我认为这可能是多方法的一个很好的用例，所以我一起破解了一些东西，也许这会让你开始：

（这里的代码很丑，你也可以看看这个要点）

(ns tutorial.scrape1
  (:require [net.cgrand.enlive-html :as html]))

(def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

(defn get-page [url] 
  (html/html-resource (java.net.URL. url))) 

(defn content->string [content]
  (cond
   (nil? content)    ""
   (string? content) content
   (map? content)    (content->string (:content content))
   (coll? content)   (apply str (map content->string content))
   :else             (str content)))

(derive clojure.lang.PersistentStructMap ::Map)
(derive clojure.lang.PersistentArrayMap  ::Map)
(derive java.lang.String                 ::String)
(derive clojure.lang.ISeq                ::Collection)
(derive clojure.lang.PersistentList      ::Collection)
(derive clojure.lang.LazySeq             ::Collection)

(defn tag-type [node]
  (case (:tag node) 
   :tr    ::CompoundNode
   :table ::CompoundNode
   :th    ::TerminalNode
   :td    ::TerminalNode
   :h3    ::TerminalNode
   :tbody ::IgnoreNode
   ::IgnoreNode))

(defmulti parse-node
  (fn [node]
    (let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))

(defmethod parse-node [::Map ::TerminalNode] [node]
  (content->string (:content node)))
(defmethod parse-node [::Map ::CompoundNode] [node]
  (map parse-node (:content node)))
(defmethod parse-node [::Map ::IgnoreNode] [node]
  (parse-node (:content node)))
(defmethod parse-node [::String nil] [node]
  node)
(defmethod parse-node [::Collection nil] [node]
  (map parse-node node))

(defn h3+table [url] 
 (let [ws-content (get-page url)
       h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
                                            [:div#prospekt_container :table]})]
   (for [node h3s+tables] (parse-node node))))

关于正在发生的事情的几句话：

content->string<br/>获取一个数据结构并将其内容收集到一个字符串中并返回，以便您可以将其应用于可能仍包含您想要忽略的嵌套子标签（如）的内容。

派生语句建立了一个临时层次结构，我们稍后将在多方法解析节点中使用它。这很方便，因为我们永远不知道我们将遇到哪些数据结构，我们可以在以后轻松添加更多案例。

该tag-type函数实际上是一个模仿层次结构语句的 hack - AFAIK 你不能用非命名空间限定的关键字创建层次结构，所以我这样做了。

多方法parse-node在节点的类上分派，如果节点是地图，则另外在tag-type.

现在我们所要做的就是定义适当的方法：如果我们在终端节点，我们将内容转换为字符串，否则我们要么在内容上递归，要么在我们正在处理的集合上映射 parse-node 函数. for 的方法::String实际上甚至没有使用，但为了安全起见，我把它留了下来。

该h3+table功能与您之前的功能几乎相同，我稍微简化了选择器并将它们放入一个集合中，不确定将它们放入地图中是否按预期工作。

快乐刮！

score 1 · Accepted Answer

你的问题很难理解，但我认为你的最后一行应该是

(h3+table url)

这将返回一个深度嵌套的数据结构，其中包含抓取的 HTML，然后您可以使用常用的 Clojure 序列 API 深入研究该结构。祝你好运。

clojure - 使用 Enlive 重新抓取数据

2 回答 2

Related

Reference