如何在 c.OnResponse 中获取 HTML.title - 或者是否有更好的替代方法来用 url/title/content 填充 Struct
- 最后,我需要填写以下结构并将其发布到 elasticsearch。
type WebPage struct {
Url string `json:"url"`
Title string `json:"title"`
Content string `json:"content"`
}
// Print the response
c.OnResponse(func(r *colly.Response) {
pageCount++
log.Println(r.Headers)
webpage := WebPage{
Url: r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")
Title: "my title", //string(r.title), // Where to get this?
Content: string(r.Body), //string(r.Body) - can be done in c.OnResponse
}
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(webpage) // SEND it to elasticsearch
log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, urlVisited))
})
我可以通过如下方法获得标题,但是 Ctx 不可用,因此我无法将“标题”值放入 Ctx。其他选择?
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!
})
日志
2020/05/07 17:42:37 7 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/learn/build-portfolio-website-html-css",
"title": "my page title",
"content": "page html body bla "
}
2020/05/07 17:42:37 8 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/browse/social-sciences",
"title": "my page title",
"content": "page html body bla "
}