Portia, an open-source visual web scraper (opens in new tab)

(blog.scrapinghub.com)

367 pointspablohoffman12y ago67 comments

67 comments

The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...

All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

alttab12y ago

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.

In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.

To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.

In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.

shaneofalltrad12y ago

Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.

1 more reply

stedaniels12y ago

Why would scrape JIRA when they have a perfectly workable API?

1 more reply

notastartup12y ago

how would a bookmarklet be able to crawl & scrape a website?

1 more reply

CHsurfer12y ago

At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)

yaph12y ago

Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.

johndavi12y ago

The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).

Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.

The web keeps evolving though, dang it. Tricky thing!

lsh12y ago

Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?

CMCDragonkai12y ago

I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.

It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.

agumonkey12y ago

It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

uptown12y ago

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

CMCDragonkai12y ago

Try using https://snapsearch.io/ It is designed for JS sites.

jdavis70312y ago

These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.

1 more reply

egb12y ago

Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?

egb12y ago

Answering my own question:

CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents

1 more reply

CMCDragonkai12y ago

I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/

checker65912y ago

PhantomJS?

3 more replies

bsilvereagle12y ago

I expected an April Fool's joke and found something pleasantly awesome and useful instead.

Source is here: https://github.com/scrapinghub/portia

climatewarrior212y ago

I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.

kh_hk12y ago

I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.

For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:

An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.

The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

lifeisstillgood12y ago

Surely there are difficulties in expecting data providers to produce their data in standard formats across industries and countries? I am naive as to how much and what data is available but that seems a stretch

kh_hk12y ago

If interested, take a look at my project on unifying bike sharing networks data. Besides providing a public API, we are also providing a python library that accesses and abstracts different sources under the same model [1, 2]

There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]

What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).

Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).

That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.

Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:

    Feed A:
    {
      "status": "ok",
      "foobars": [
        {
          "name": "Foo",
          "bar": "Baz"
        }, ...
      ]
    }

    Feed B
    [{"n": "foo","info": {"b": "baz"}},...]

    We could provide:
    {
      "feeds": [
        {
          "name": "Feed A",
          "url": "http://feed.a",
          "format": "json",
          "fields": {
            "name": "/foobars//name",
            "bar": "/foobars//bar"
          }
        },
        {
          "name": "Feed B",
          "url": "http://feed.b",
          "format": "json",
          "fields": {
            "name": "//n",
            "bar": "//info/b"
          }
      ]
    }
    Instead of providing a service ourselves that accesses Feed A and Feed B
    every minute just because we want to ease things on the client.

Not sure if that's what you asked, though.

[1]: http://citybik.es

[2]: http://github.com/eskerda/pybikes

[3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

[4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

1 more reply

compare12y ago

Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.

anilshanbhag12y ago

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at https://hasjob.co hoping to find trends.

There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.

UPDATE: there is a logfile setting to dump output to file

emilsedgh12y ago

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

meritt12y ago

Yep. It's just a GUI that generates scrapy (python) code.

jstoiko12y ago

Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: https://github.com/brandicted/scrapy-webdriver). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...

ashwing_200512y ago

This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?

kmike8412y ago

Portia uses https://github.com/scrapy/scrapely library for data extraction. It doesn't use XPaths for learning. There are some links to papers in scrapely README; scrapely is largely based on ideas from these papers, but there are many improvements. In short - yes, this is taken care of.

esolyt12y ago

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.

I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

kelvin012y ago

Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.

oblio12y ago

Totally off topic, but what's the name of the song in the video? :)

duendex12y ago

It's been made just for the video :)

oblio12y ago

Could you please upload it somewhere? It's really catchy :)

rpedela12y ago

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

pablohoffmanOP12y ago

Yes, you just need to select a different field type ("text", instead of "html").

alttab12y ago

This is cool. Can I use it locally on internal sites too?

antocv12y ago

Why wouldnt you??

tabel12y ago

I believe the GUI is run locally, but if it was run as a web application from the developers site it would only be able to scrape sites accessible to the public internet.

th0ma512y ago

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutz12y ago

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.

Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.

http://www.visualwebripper.com

viana00712y ago

This solution remembers Pyquery, but using a visual interface.

kclay12y ago

Love this, been using Scrapy for all my scraping needs.

rpedela12y ago

Is there a live demo available?

e1g12y ago

Not affiliated with the OP or the project, but I threw up a sandbox to play with this at http://awstest123.notesies.com:9001/static/main.html

pablohoffmanOP12y ago

Not yet.

taskstrike12y ago

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

frabcus12y ago

It's always pretty hot! Some still around - like Kapow, Connotate, Mozenda, ScraperWiki (which I run). Some not - Needlebase.

Portia is more interesting because it is an open source scraping GUI - the GUIs tend to be very proprietary.

notastartup12y ago

I also wrote http://scrape.ly, which let's you write web scrapers via a URL syntax.

notastartup12y ago

Here's an open source web scraping GUI I wrote a while back https://github.com/jjk3/scrape-it-screen-scraper

I'm still integrating the browser engine which I was able to procure for open source purposes.

The video is quite old.

j / k navigate · click thread line to collapse

67 comments

dabeeeenster12y ago

All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

alttab12y ago

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.

In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.

In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.

shaneofalltrad12y ago

1 more reply

stedaniels12y ago

Why would scrape JIRA when they have a perfectly workable API?

1 more reply

notastartup12y ago

how would a bookmarklet be able to crawl & scrape a website?

1 more reply

CHsurfer12y ago

yaph12y ago

johndavi12y ago

Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.

The web keeps evolving though, dang it. Tricky thing!

lsh12y ago

Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?

CMCDragonkai12y ago

It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.

agumonkey12y ago

It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

uptown12y ago

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

CMCDragonkai12y ago

Try using https://snapsearch.io/ It is designed for JS sites.

jdavis70312y ago

1 more reply

egb12y ago

Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?

egb12y ago

Answering my own question:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents

1 more reply

CMCDragonkai12y ago

I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/

checker65912y ago

PhantomJS?

3 more replies

bsilvereagle12y ago

I expected an April Fool's joke and found something pleasantly awesome and useful instead.

Source is here: https://github.com/scrapinghub/portia

climatewarrior212y ago

kh_hk12y ago

lifeisstillgood12y ago

kh_hk12y ago

There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]

Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:

    Feed A:
    {
      "status": "ok",
      "foobars": [
        {
          "name": "Foo",
          "bar": "Baz"
        }, ...
      ]
    }

    Feed B
    [{"n": "foo","info": {"b": "baz"}},...]

    We could provide:
    {
      "feeds": [
        {
          "name": "Feed A",
          "url": "http://feed.a",
          "format": "json",
          "fields": {
            "name": "/foobars//name",
            "bar": "/foobars//bar"
          }
        },
        {
          "name": "Feed B",
          "url": "http://feed.b",
          "format": "json",
          "fields": {
            "name": "//n",
            "bar": "//info/b"
          }
      ]
    }
    Instead of providing a service ourselves that accesses Feed A and Feed B
    every minute just because we want to ease things on the client.

Not sure if that's what you asked, though.

[1]: http://citybik.es

[2]: http://github.com/eskerda/pybikes

[3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

[4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

1 more reply

compare12y ago

anilshanbhag12y ago

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at https://hasjob.co hoping to find trends.

There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.

UPDATE: there is a logfile setting to dump output to file

emilsedgh12y ago

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

meritt12y ago

Yep. It's just a GUI that generates scrapy (python) code.

jstoiko12y ago

ashwing_200512y ago

kmike8412y ago

esolyt12y ago

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.

I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

kelvin012y ago

oblio12y ago

Totally off topic, but what's the name of the song in the video? :)

duendex12y ago

It's been made just for the video :)

oblio12y ago

Could you please upload it somewhere? It's really catchy :)

rpedela12y ago

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

pablohoffmanOP12y ago

Yes, you just need to select a different field type ("text", instead of "html").

alttab12y ago

This is cool. Can I use it locally on internal sites too?

antocv12y ago

Why wouldnt you??

tabel12y ago

I believe the GUI is run locally, but if it was run as a web application from the developers site it would only be able to scrape sites accessible to the public internet.

th0ma512y ago

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

beernutz12y ago

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.

Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.

http://www.visualwebripper.com

viana00712y ago

This solution remembers Pyquery, but using a visual interface.

kclay12y ago

Love this, been using Scrapy for all my scraping needs.

rpedela12y ago

Is there a live demo available?

e1g12y ago

Not affiliated with the OP or the project, but I threw up a sandbox to play with this at http://awstest123.notesies.com:9001/static/main.html

pablohoffmanOP12y ago

Not yet.

taskstrike12y ago

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

frabcus12y ago

It's always pretty hot! Some still around - like Kapow, Connotate, Mozenda, ScraperWiki (which I run). Some not - Needlebase.

Portia is more interesting because it is an open source scraping GUI - the GUIs tend to be very proprietary.

notastartup12y ago

I also wrote http://scrape.ly, which let's you write web scrapers via a URL syntax.

notastartup12y ago

Here's an open source web scraping GUI I wrote a while back https://github.com/jjk3/scrape-it-screen-scraper

I'm still integrating the browser engine which I was able to procure for open source purposes.

The video is quite old.

j / k navigate · click thread line to collapse