All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...
In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.
To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.
In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.
Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.
The web keeps evolving though, dang it. Tricky thing!
It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.
Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.
But many content owners would never provide their data in this format even if doing-so would be trivial.
CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:
defining & ordering browsing navigation steps
filling & submitting forms
clicking & following links
capturing screenshots of a page (or part of it)
testing remote DOM
logging events
downloading resources, including binary ones
writing functional test suites, saving results as JUnit XML
scraping Web contentsSource is here: https://github.com/scrapinghub/portia
For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:
An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.
The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.
There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]
What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).
Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).
That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.
Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:
Feed A:
{
"status": "ok",
"foobars": [
{
"name": "Foo",
"bar": "Baz"
}, ...
]
}
Feed B
[{"n": "foo","info": {"b": "baz"}},...]
We could provide:
{
"feeds": [
{
"name": "Feed A",
"url": "http://feed.a",
"format": "json",
"fields": {
"name": "/foobars//name",
"bar": "/foobars//bar"
}
},
{
"name": "Feed B",
"url": "http://feed.b",
"format": "json",
"fields": {
"name": "//n",
"bar": "//info/b"
}
]
}
Instead of providing a service ourselves that accesses Feed A and Feed B
every minute just because we want to ease things on the client.
Not sure if that's what you asked, though.[1]: http://citybik.es
[2]: http://github.com/eskerda/pybikes
[3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...
[4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...
There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.
UPDATE: there is a logfile setting to dump output to file
I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.
Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.
Portia is more interesting because it is an open source scraping GUI - the GUIs tend to be very proprietary.
I'm still integrating the browser engine which I was able to procure for open source purposes.
The video is quite old.