go - 如何将 go-colly 连接到 elasticsearch？

Question

我在下面的代码中使用 go-colly 对弹性索引进行了哪些更改？

我想获取全文（剥离 html，剥离 js，如果需要，渲染），然后
使其符合 avro 模式 {pageurl: , title:, content:},
批量发布到特定的弹性搜索“mywebsiteindex-yyyymmdd” - 可能使用配置文件，而不是硬编码。

代码片段会很棒。是否有一个示例 go-colly 代码显示“流水线”输出 crawl->scraping->yield to elastic（例如在 python scrapy 框架中）。即流水线框架支持。

为了插入弹性，我正在考虑：https://github.com/olivere/elastic？

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains( "www.coursera.org"),
        colly.Async(true),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob: "*",
         Parallelism: 2,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })
    pageCount :=0
    c.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
    })

    // Set error handler
    c.OnError(func(r *colly.Response, err error) {
        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        pageCount++
        urlVisited := r.Ctx.Get("url")
        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, urlVisited))
    })

    baseUrl := "https://www.coursera.org"
    c.Visit(baseUrl)
 }

score 0 · Accepted Answer

你是正确的，你需要一个额外的库来将数据存储到弹性中。go-colly 只是做抓取工作的一部分。根据您的抓取策略，您将需要编写代码来将抓取结果存储到索引中。

通常，您想使用像Olivere/elastic这样的库，连接到 elastic 并初始化索引。然后，您可能希望拥有一个将结构化数据存储到该索引中的函数，并c.OnHTML()在您拥有要存储的所有数据时从适当的 go-colly 回调（例如）调用该函数（从提供的代码片段中不清楚的是什么））。要阅读有关如何使用 Olivere/elastic 的更多信息（请注意，第 7 版的 API 发生了重大更改，因此旧版本的一些教程可能无法使用）请参阅godoc。

根据您的特定用例（例如，决定如何在索引中构建数据，何时将数据发送到弹性 - 意味着使用哪个 go-colly 回调，您希望如何刷新），需要做出许多决定已经在索引中的页面等）。

至于框架，我不知道从抓取到弹性存储的任何端到端管道。

go - 如何将 go-colly 连接到 elasticsearch？

1 回答 1

Related

Reference