这是一个解析一些网站的程序。第一个站点是site1。解析该特定站点的所有逻辑都位于 (-> config :site1)
(ns program.core
(require [net.cgrand.enlive-html :as html]))
(def config
{:site1
{:site-url
["http://www.site1.com/page/1"
"http://www.site1.com/page/2"
"http://www.site1.com/page/3"
"http://www.site1.com/page/4"]
:url-encoding "iso-8859-1"
:parsing-index
{:date
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table [:tr (html/nth-of-type 2)]
[:td (html/nth-of-type 3)] [:span]]
:trimming-fn
(comp first :content) ; (first) to remove extra parenthese
}
:title
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]]
:trimming-fn
(comp first :content first :content)
}
:url
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]]
:trimming-fn
#(str "http://www.site.com" (:href (:attrs %)))
}
}
}})
;=== Fetch fn ===;
(defn fetch-encoded-url
([url] (fetch-encoded-url url "utf-8"))
([url encoding] (-> url java.net.URL.
.getContent
(java.io.InputStreamReader. encoding)
html/html-resource)))
现在我想解析 (-> config :site1 :site-url) 中包含的页面。在这个例子中,我只使用了第一个 url,但是我如何设计它来真正for
为所有的 URL 做一个主人?
(defn parse-element [element]
(into [] (map (-> config :site1 :parsing-index element :trimming-fn)
(html/select
(fetch-encoded-url
(-> config :site1 :site-url first)
(-> config :site1 :url-encoding))
(-> config :site1 :parsing-index element :selector)))))
(def element-lists
(apply map vector
(map parse-element (-> config :site1 :parsing-index keys))))
(def tagged-lists
(into [] (for [element-list element-lists]
(zipmap [:date :title :url] element-list))))
;==== Fn call ====
(println tagged-lists)