Colly: Fast and Elegant Scraping Framework for Golang (opens in new tab)

(github.com)

120 pointsasciimoo8y ago58 comments

58 comments

You will get to initial source code with this, but if you want JavaScript to work, you just use PhantomJS. If you want PhantomJS to be usable, you is casperJS... Until you find a site with FuzeJS or some other JavaScript or html intense site. Those won't render in PhantomJS.

For that stuff, as of a few months ago, you can use Chrome headless. I wrote a couple go packages to make that easy. It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session. https://GitHub.com/integrii/headlessChrome

I was also able to smash my while scraper bot into a docker container after working around a couple bugs.

stcredzero8y ago

It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session.

That looks cool! Would I be able to run Node scripts?

simlevesque8y ago

> Would I be able to run Node scripts?

No. Since you can't run Node scripts on Chrome, the same is true for Chrome headless.

afandian8y ago

One thing I'm interested in is scraping those annoying sites that require JavaScript execution. More and more webpages are requiring JS even to display anything beyond a blank page. These sites self-select themselves for exclusion in scraping like this.

I've looked into Headless Chrome, but I'd be interested to see a 'scraping framework' level abstraction for those sites.

chatmasta8y ago

If you're targeting a specific site (as opposed to blindly spidering multiple), all JavaScript is actually better for scraping, in my experience. JavaScript apps communicate with an API 99% of the time, so to scrape them you can just replicate the API requests. And as a bonus you'll get nicely formatted JSON responses; no need to parse fragile HTML.

Similarly, I've found that most sites worth scraping also have a mobile app, which you can run through a MITM proxy, and then simply write a scraper to call the API endpoints directly.

assafmo8y ago

This is absolutely true. This is why SPAs are awesome to scrape

tym08y ago

You might have missed puppeteer[1], if you don't mind writing Javascript it seems to provide a simple interface for scraping.

[1] https://github.com/GoogleChrome/puppeteer

asciimooOP8y ago

Scraping JS-only sites is also possible without a headless browser, but requires a bit more debugging of the internal structure of these sites. Most of JS-only websites have API endpoints with JSON responses, which can make scraping more reliable than parsing custom (and sometimes invalid) HTML. The drawback of headless browser based scraping is that it requires significant amount of cpu time and memory compared to "static" scraping frameworks.

wenc8y ago

As mentioned, Puppeteer was used in this project, which is Chrome based.

I've also used Selenium (via Python -- I used BeautifulSoup to parse the resulting HTML) in the past for precisely for the reasons you stated. Selenium uses "web drivers", which lets you start other headless browsers as well (Firefox, Opera, IE, etc.).

http://www.seleniumhq.org/

http://selenium-python.readthedocs.io/

All it took was a couple of lines of Python code...

1 more reply

feelin_googley8y ago

"More and more webpages are requiring JS even to display anything beyond a blank page."

Can you provide some example webpages so we can take a look?

Also, I agree with @asciimoo's point about endpoints. One could make an argument that compared to the website design of the 1990's and 2000's, retrieving structured data from websites is actually getting easier not more difficult. I recall one period of time where the trend was to design websites entirely in Macromedia/Adobe Flash.

Here is what is missing from this project (and many others like it): when providing software that performs text processing one needs to not only provide example code but also example output. This enables a user to quickly compare her current text-processing solution with the software being provided without having to install, review and run the unfamiliar software.

For example, without some sample output she cannot test, e.g., whether her current solution produces the same output faster or using less code.

afandian8y ago

With a minute's searching, here's a webpage that's meaningless without JavaScript:

https://blogger.googleblog.com/

It's arguably a page of text not a web app that needs user interaction.

I'm not trying to start an argument about JS: the consensus on HN seems to be that if you don't execute JavaScript you don't deserve to read webpages. I'm just saying that your website's clients are more diverse than 'normal' well-sighted humans. There may be machines reading the site, for all kinds of reasons.

And regarding endpoints. One could make the argument that with AJAX we now have richer APIs. I disagree. We have a well-understood API for getting hypertext (HTTP) in a well-understood format (HTML) that works/worked across all websites. Replacing that with a custom-built API for every website isn't apples-for-apples.

1 more reply

Animats8y ago

Wix. Everything with Wix.[1]

[1] http://www.wix.com/stunningwebsites/400

1 more reply

dithering8y ago

I know of a site where all the data is embedded inline ("var Person = {firstName: 'Fred', lastName: 'Smith' }") then rendered to HTML with JS. No endpoints, Javascript not JSON, complete pain in the ass to extract without rendering in a headless browser first.

1 more reply

faitswulff8y ago

Feedly, Netflix, Google Maps, as of this article:

https://sonniesedge.co.uk/blog/a-day-without-javascript

1 more reply

slezyr8y ago

> Can you provide some example webpages so we can take a look?

learnopengl.com

1 more reply

weberc28y ago

You'd need to parse and run the JavaScript, including a virtual DOM. You'd also need to support any browser events that the page depends on. It's conceptually simple, and getting support for 80-95% of cases is probably easily doable, but I imagine a long tail of cases where your browser emulator isn't quite compatible in an important way. I know Google and others do this to index the web, but in not sure about any existing open source projects.

laktek8y ago

I released Page.REST(https://page.rest#prerender) couple of weeks ago. It will pre-render JS based sites - so you can then extract rendered content using CSS selectors.

jotto8y ago

Ditto: https://www.prerender.cloud/docs/api

  // URL to screenshot
  service.prerender.cloud/screenshot/https://www.google.com/

  // URL to pdf
  service.prerender.cloud/pdf/https://www.google.com/

  // URL to html (prerender)
  service.prerender.cloud/https://www.google.com/

spiderfarmer8y ago

You could look into PhantomJS. Also effectively a headless browser, but it's scriptable.

http://phantomjs.org

afandian8y ago

I started with that. Last time I looked, the maintainer had stepped down because of Chrome Headless.

And it's not a "scraping framework" any more than Chrome Headless is.

edit: Headless Chrome has very rich scripting integration with e.g. https://github.com/webfolderio/cdp4j

bryanrasmussen8y ago

yes, this would be capable of waiting, but it doesn't have the abstraction built in so it looks like it would have the same problem I often run into with scapers - you go through page to get links to follow/markup to act on and then have to wait a reasonable amount of time / catch some browser based event to determine if you need to look at content one more time before continuing.

drej8y ago

Borrowing an argument from an article talking the speed of Python: focus on what's your bottleneck. If you're worried about the performance of your tool, make sure it's not actually waiting for something else (IO, network, scheduling, ...).

Here, unless you're parsing a large amount of already downloaded files (a website dump, [re]parsing of a long-standing archive etc.), you're not going to get a huge benefit from using a fast parser, because network is going to be the challenging factor.

Keep that in mind.

Walkman8y ago

I totally agree! I like Go but this is not a field I would ever use it. Parsing the site-s will never will be the bottleneck, but following the HTML changes or making it work for multiple sites is... When it takes maybe even seconds to download a page, a couple of millisecond performance advantage of Go doesn't matter at all.

Xeoncross8y ago

Remember to setup DNS caching on the box or use something like https://github.com/viki-org/dnscache.

Also, there doesn't seem to be any checking reading the response body. You want to limit the read length.

gjem978y ago

The HTML parsing part appears to be "golang.org/x/net/html". Does anyone have experience parsing "real world" html with this? How does it do?

jerf8y ago

It's an HTML5-compliant parser. That means that, modulo bugs, it should produce the same results as any other modern HTML parser, which should also all be based on HTML5.

For context, since I think a lot of people are still unaware of this, the HTML5 standard precisely specifies how HTML should be parsed: https://www.w3.org/TR/html5/syntax.html#parsing This is based on a survey of how the various browsers were handling it in reality, so it's not just one of those theoretical things that everybody ignores, it's an algorithm extracted from the brutal pragmatism of many separate code bases over many years. In theory now, all HTML parsing libraries should now be able to take the same input and produce the same DOM nodes. In practice I've not used a variety of such libraries, nor have I fed them very much pathological input, so I can't vouch for if this 100% true in practice, but in theory, there should no longer be any significant differences between HTML parsers in various languages, as they come on board with HTML5 compliance.

anonacct378y ago

I've used it to parse alot of broken HTML for a very large company.

I have not used it to attempt to parse all the HTML on the web.

My impression is that it's pretty good.

stesch8y ago

Why are scrapers and scraping so popular? What is a real use case for it?

troels8y ago

Most businesses I've worked for/with had one reason or another to scrape data from various places. Often it's watching competitors, but can also be part of an automation of some process that is already happening manually.

stephengillie8y ago

Not seeing ads.

Secondarily, there is a lot of data on the internet stored only in HTML pages. For data with multiple sources, HTML is still usually a common format. HTML just has more punctuation and errata to filter out than JSON, XML, or CSV.

stesch8y ago

So you want the data but you don't want to ask for it?

1 more reply

weberc28y ago

Accessing information that isn't available via an API. For example, much information on HN is only available via scraping. Many banks can't afford to rebuild their backend (probably largely due to the liability/compliance costs) to support an API/client model, but they still want an application--such applications are pretty much required to scrape said banks HTML.

conradfr8y ago

As an example:

I listen to the radio a lot and have always wanted to make a website with radio schedules like there is for TV. As this data is not available (at least not in France) I scrap [1] each radio website everyday to get it.

[1] Using https://github.com/rchipka/node-osmosis for now

maxpert8y ago

OnHtml callbacks? Not a big fan of callbacks when you have channels.

weberc28y ago

Channels require concurrency; you have to spin up another goroutine and take care not to let or deadlock. Channels are for communication between goroutines, not for general abstraction.

hnlmorg8y ago

I agree. Callbacks makes sense in this context and is the idiomatic way to write the code given that's the approach used in the standard library. eg:

https://golang.org/pkg/path/filepath/#Walk

https://golang.org/pkg/net/http/#HandleFunc

weberc28y ago

*Take care not to leak or deadlock

asciimooOP8y ago

Interesting idea, how do you imagine a channel based API for this?

hellcow8y ago

I would ignore the GP's advice. Channels are prone to big errors -- panics and blocking -- which aren't detectable at compile time. They make sense to use internally but shouldn't be exposed in a public API. As one example, notice how the standard library's net/http package doesn't require you to use channels, but it uses them internally.

JoeAcchino8y ago

Would this work?

  c := colly.NewCollector()

  // this functions create a goroutine and returns a channel
  ch := c.HTML("a")  
  e := <- ch
  link := e.Attr("href")
  // ...

I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.

2 more replies

tree_of_item8y ago

Is there any reason to use this instead of Puppeteer? It feels like Puppeteer is going to dominate this space unless another browser vendor makes their own framework.

j / k navigate · click thread line to collapse

58 comments

integrii8y ago

I was also able to smash my while scraper bot into a docker container after working around a couple bugs.

stcredzero8y ago

It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session.

That looks cool! Would I be able to run Node scripts?

simlevesque8y ago

> Would I be able to run Node scripts?

No. Since you can't run Node scripts on Chrome, the same is true for Chrome headless.

afandian8y ago

I've looked into Headless Chrome, but I'd be interested to see a 'scraping framework' level abstraction for those sites.

chatmasta8y ago

Similarly, I've found that most sites worth scraping also have a mobile app, which you can run through a MITM proxy, and then simply write a scraper to call the API endpoints directly.

assafmo8y ago

This is absolutely true. This is why SPAs are awesome to scrape

tym08y ago

You might have missed puppeteer[1], if you don't mind writing Javascript it seems to provide a simple interface for scraping.

[1] https://github.com/GoogleChrome/puppeteer

asciimooOP8y ago

wenc8y ago

As mentioned, Puppeteer was used in this project, which is Chrome based.

http://www.seleniumhq.org/

http://selenium-python.readthedocs.io/

All it took was a couple of lines of Python code...

1 more reply

feelin_googley8y ago

"More and more webpages are requiring JS even to display anything beyond a blank page."

Can you provide some example webpages so we can take a look?

For example, without some sample output she cannot test, e.g., whether her current solution produces the same output faster or using less code.

afandian8y ago

With a minute's searching, here's a webpage that's meaningless without JavaScript:

https://blogger.googleblog.com/

It's arguably a page of text not a web app that needs user interaction.

1 more reply

Animats8y ago

Wix. Everything with Wix.[1]

[1] http://www.wix.com/stunningwebsites/400

1 more reply

dithering8y ago

1 more reply

faitswulff8y ago

Feedly, Netflix, Google Maps, as of this article:

https://sonniesedge.co.uk/blog/a-day-without-javascript

1 more reply

slezyr8y ago

> Can you provide some example webpages so we can take a look?

learnopengl.com

1 more reply

weberc28y ago

laktek8y ago

I released Page.REST(https://page.rest#prerender) couple of weeks ago. It will pre-render JS based sites - so you can then extract rendered content using CSS selectors.

jotto8y ago

Ditto: https://www.prerender.cloud/docs/api

  // URL to screenshot
  service.prerender.cloud/screenshot/https://www.google.com/

  // URL to pdf
  service.prerender.cloud/pdf/https://www.google.com/

  // URL to html (prerender)
  service.prerender.cloud/https://www.google.com/

spiderfarmer8y ago

You could look into PhantomJS. Also effectively a headless browser, but it's scriptable.

http://phantomjs.org

afandian8y ago

I started with that. Last time I looked, the maintainer had stepped down because of Chrome Headless.

And it's not a "scraping framework" any more than Chrome Headless is.

edit: Headless Chrome has very rich scripting integration with e.g. https://github.com/webfolderio/cdp4j

bryanrasmussen8y ago

drej8y ago

Keep that in mind.

Walkman8y ago

Xeoncross8y ago

Remember to setup DNS caching on the box or use something like https://github.com/viki-org/dnscache.

Also, there doesn't seem to be any checking reading the response body. You want to limit the read length.

gjem978y ago

The HTML parsing part appears to be "golang.org/x/net/html". Does anyone have experience parsing "real world" html with this? How does it do?

jerf8y ago

It's an HTML5-compliant parser. That means that, modulo bugs, it should produce the same results as any other modern HTML parser, which should also all be based on HTML5.

anonacct378y ago

I've used it to parse alot of broken HTML for a very large company.

I have not used it to attempt to parse all the HTML on the web.

My impression is that it's pretty good.

stesch8y ago

Why are scrapers and scraping so popular? What is a real use case for it?

troels8y ago

stephengillie8y ago

Not seeing ads.

stesch8y ago

So you want the data but you don't want to ask for it?

1 more reply

weberc28y ago

conradfr8y ago

As an example:

[1] Using https://github.com/rchipka/node-osmosis for now

maxpert8y ago

OnHtml callbacks? Not a big fan of callbacks when you have channels.

weberc28y ago

Channels require concurrency; you have to spin up another goroutine and take care not to let or deadlock. Channels are for communication between goroutines, not for general abstraction.

hnlmorg8y ago

I agree. Callbacks makes sense in this context and is the idiomatic way to write the code given that's the approach used in the standard library. eg:

https://golang.org/pkg/path/filepath/#Walk

https://golang.org/pkg/net/http/#HandleFunc

weberc28y ago

*Take care not to leak or deadlock

asciimooOP8y ago

Interesting idea, how do you imagine a channel based API for this?

hellcow8y ago

JoeAcchino8y ago

Would this work?

  c := colly.NewCollector()

  // this functions create a goroutine and returns a channel
  ch := c.HTML("a")  
  e := <- ch
  link := e.Attr("href")
  // ...

I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.

2 more replies

tree_of_item8y ago

Is there any reason to use this instead of Puppeteer? It feels like Puppeteer is going to dominate this space unless another browser vendor makes their own framework.

j / k navigate · click thread line to collapse