For that stuff, as of a few months ago, you can use Chrome headless. I wrote a couple go packages to make that easy. It basically runs headless chrome with a JavaScript REPL console you can use to interact with the session. https://GitHub.com/integrii/headlessChrome
I was also able to smash my while scraper bot into a docker container after working around a couple bugs.
That looks cool! Would I be able to run Node scripts?
No. Since you can't run Node scripts on Chrome, the same is true for Chrome headless.
I've looked into Headless Chrome, but I'd be interested to see a 'scraping framework' level abstraction for those sites.
Similarly, I've found that most sites worth scraping also have a mobile app, which you can run through a MITM proxy, and then simply write a scraper to call the API endpoints directly.
I've also used Selenium (via Python -- I used BeautifulSoup to parse the resulting HTML) in the past for precisely for the reasons you stated. Selenium uses "web drivers", which lets you start other headless browsers as well (Firefox, Opera, IE, etc.).
http://selenium-python.readthedocs.io/
All it took was a couple of lines of Python code...
Can you provide some example webpages so we can take a look?
Also, I agree with @asciimoo's point about endpoints. One could make an argument that compared to the website design of the 1990's and 2000's, retrieving structured data from websites is actually getting easier not more difficult. I recall one period of time where the trend was to design websites entirely in Macromedia/Adobe Flash.
Here is what is missing from this project (and many others like it): when providing software that performs text processing one needs to not only provide example code but also example output. This enables a user to quickly compare her current text-processing solution with the software being provided without having to install, review and run the unfamiliar software.
For example, without some sample output she cannot test, e.g., whether her current solution produces the same output faster or using less code.
https://blogger.googleblog.com/
It's arguably a page of text not a web app that needs user interaction.
I'm not trying to start an argument about JS: the consensus on HN seems to be that if you don't execute JavaScript you don't deserve to read webpages. I'm just saying that your website's clients are more diverse than 'normal' well-sighted humans. There may be machines reading the site, for all kinds of reasons.
And regarding endpoints. One could make the argument that with AJAX we now have richer APIs. I disagree. We have a well-understood API for getting hypertext (HTTP) in a well-understood format (HTML) that works/worked across all websites. Replacing that with a custom-built API for every website isn't apples-for-apples.
learnopengl.com
// URL to screenshot
service.prerender.cloud/screenshot/https://www.google.com/
// URL to pdf
service.prerender.cloud/pdf/https://www.google.com/
// URL to html (prerender)
service.prerender.cloud/https://www.google.com/And it's not a "scraping framework" any more than Chrome Headless is.
edit: Headless Chrome has very rich scripting integration with e.g. https://github.com/webfolderio/cdp4j
Here, unless you're parsing a large amount of already downloaded files (a website dump, [re]parsing of a long-standing archive etc.), you're not going to get a huge benefit from using a fast parser, because network is going to be the challenging factor.
Keep that in mind.
Also, there doesn't seem to be any checking reading the response body. You want to limit the read length.
For context, since I think a lot of people are still unaware of this, the HTML5 standard precisely specifies how HTML should be parsed: https://www.w3.org/TR/html5/syntax.html#parsing This is based on a survey of how the various browsers were handling it in reality, so it's not just one of those theoretical things that everybody ignores, it's an algorithm extracted from the brutal pragmatism of many separate code bases over many years. In theory now, all HTML parsing libraries should now be able to take the same input and produce the same DOM nodes. In practice I've not used a variety of such libraries, nor have I fed them very much pathological input, so I can't vouch for if this 100% true in practice, but in theory, there should no longer be any significant differences between HTML parsers in various languages, as they come on board with HTML5 compliance.
I have not used it to attempt to parse all the HTML on the web.
My impression is that it's pretty good.
Secondarily, there is a lot of data on the internet stored only in HTML pages. For data with multiple sources, HTML is still usually a common format. HTML just has more punctuation and errata to filter out than JSON, XML, or CSV.
I listen to the radio a lot and have always wanted to make a website with radio schedules like there is for TV. As this data is not available (at least not in France) I scrap [1] each radio website everyday to get it.
[1] Using https://github.com/rchipka/node-osmosis for now
c := colly.NewCollector()
// this functions create a goroutine and returns a channel
ch := c.HTML("a")
e := <- ch
link := e.Attr("href")
// ...
I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.