The Go Blog

golang gocolly/colly

bantana
20 July 2018

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
  c := colly.NewCollector()

  // Find and visit all links
  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    e.Request.Visit(e.Attr("href"))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

go get -u github.com/gocolly/colly/...

useage

import "github.com/gocolly/colly"

c := colly.NewCollector()

回掉函数的调用顺序如下:

1. OnRequest

在发起请求前被调用

2. OnError

请求过程中如果发生错误被调用

3. OnResponse

收到回复后被调用

4. OnHTML

在OnResponse之后被调用,如果收到的内容是HTML

5. OnScraped

在OnHTML之后被调用

Related articles