我编写了一个 colly 脚本来从站点收集端口授权信息。
func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
// // Filter domains affected by this rule
DomainGlob: "searates.com/*",
// // Set a delay between requests to these domains
Delay: 1 * time.Second,
// // Add an additional random delay
RandomDelay: 3 * time.Second,
})
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
h.Request.Visit(link)
})
})
// Find and visit all ports info page
c.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
fmt.Println("Port Authority: ", portAuth)
})
c.Visit("https://www.searates.com/maritime/")
}
我有以下两个问题:
此外,我有点被迫使用
e.Request.Visit
,因为d.Visit
(如果我克隆 c)没有被执行。我看到当我将 c 克隆为 d 并用于获取“端口信息”部分时,整个块都被跳过了。我在这里做错了什么/为什么会出现这种行为?在当前代码中,
fmt.Println("Port Authority: ", portAuth)
get 执行了两次。我得到如下打印:
❯ go run .
Country: Albania /maritime/albania
Port: Durres /port/durres_al
Port Authority: Durres Port Authority
Port Authority:
Port: Sarande /port/sarande_al
Port Authority: Sarande Port Authority
Port Authority:
Port: Shengjin /port/shengjin_al
Port Authority: Shengjin Port Authority
Port Authority:
同样,我无法理解为什么它会被打印两次。请帮助:)