Web Scraping a Javascript Heavy Website: Keeping Things Simple (opens in new tab)

(tubes.io)

49 pointskuhn12y ago15 comments

15 comments

I used to use the network tab for stuff like this, but now I almost exclusively use mitmproxy[0]. Once things get sufficiently complicated, the constant scrolling and clicking around in the network tab feels tedious. Plus it's difficult to capture activity if a site has popups or multiple windows. mitmproxy solves these problems and also has a ton more features like replaying requests and saving to files. My ideal tool involves something that translates mitmdump into code that performs the equivalent raw HTTP requests (e.g. using python's requests). Sort of like Selenium's IDE but for super lightweight scraping.

[0] http://mitmproxy.org/

gwu7812y ago

mitmproxy sounds like a lot of overhead if all you want is your own raw HTTP traffic. You can get this without Python, and without mitmproxy. Also, I thought mitmproxy was intended for HTTPS. Even in that case, I'm not sure installing Python and mitmproxy is necessary if all you want is to view your own traffic. You can just run your own CA and a proxy that can terminate SSL (e.g., haproxy).

Below is a simple, _lightweight_ ngrep solution. RE means a regular expression. This only saves packets with the RE you specify and does not save full packets, only the HTTP headers. 1024 is an arbitrary size to get all HTTP headers; adjust to taste. tcpdump is there only because ngrep does not work well with PPPoE. If you don't use PPPoE you don't need to include tcpdump.

     case $# in
     1)
     # capture HTTP headers to pcap file
     tcpdump -Ulvvvns1024 -w- tcp 2>/dev/null \
     |ngrep -O$1 -qtWbyline 'GET|POST|HEAD' >/dev/null 
     ;;
     2)
     # search HTTP headers in pcap file
     ngrep -Wbyline -qtI$1 $2
     ;;
     *)
     echo usage: $0 pcap-file \[RE\]
     esac

To dump your results, try

     $0 pcap-file . |less

And here's a little script to make URL's from your pcap file. unvis just decodes URL's from the specs in RFC's 1808 and 1866. It assumes http:// URL's (no ftp://). The awk script ensures all URL's (not just consecutive ones) are unique.

    case $# in
    [12])
    above-script $1 ${2-.} \
    |sed -n '
    /GET/p;
    /Host: /p;
    '  \
    |tr '\012' '\040' \
    |sed 's/GET/\
    &/g' \
    |awk '
    !($0 in a){a[$0];print "http://"$5$2}
    ' \
    |sed '
    s/%25/%/g;
    s/\.\//\//;
    ' \
    |unvis -hH \
    |sed '/^http:[/][/]./!d; 
    s/ /%20/g' \
    ;;
    *)
    echo usage: $0 pcap-file \[RE\] >&2
    esac

It's trivial to dump HTTP. You can feed this to netcat (using sed to modify the HTML to your liking), then open the result in your browser. Whatever you are aiming to do (I'm still not exactly sure - can you give an example?), I reckon it can be automated without Python and heaps of libraries.

PMan7412y ago

Wow I thought mitmproxy looked rough until I saw tcpdump/ngrep/awk? They both work of course but neither look especially easy to use.

We've been using http://www.charlesproxy.com/ for years, great tool (cheap albeit not free)

1 more reply

hazz12y ago

In many cases websites that load data asynchronously through an API are much nicer to scrape, as the data is already structured for you. You don't have to go through the pain of extracting data from a mess of tables, divs and spans.

bdcravens12y ago

I've done a lot of scraping. Some sites use heavy Javascript frameworks that generate session IDs and request IDs that the XHR requests use to "authenticate" the request. In these situations, the amount of work to reverse engineer that workflow is pretty large. In these situations, I lean on headless Selenium. I know there are some lighter solutions, but Selenium offers some distinct advantages:

1) lot of library support, in multiple languages

2) without having to fake UAs, etc, the requests look more like a regular user (all media assets downloaded, normal browser UA, etc)

3) simple clustering: setting up a Selenium grid is very easy, and switching from local instance of Selenium to using the grid requires very little code change (1 line in most cases)

cynwoody12y ago

HtmlUnit† is also effective in such cases. HtmlUnit is intended to automate testing of websites. However, the very facilities that enable it to be useful for that purpose also make it useful for scraping.

A few years ago, I wanted to analyze retail store customer feedback data collected by a third party company. The stores were franchises, and the third party was anointed by the franchising company. The data was presented to the user (franchisee store management) via a fancy web site with its own opinion about how the data should be analyzed. My opinion differed. I wanted the data in low-level, RDBMS-friendly form, so that I could recast it every which way (and come back and do it again a new way I thought of). However, such was not forthcoming (big company, little franchisee).

The solution was to make a robot that put the third party company's portal through its paces at the finest granularity, scraping the numbers into a DB as they tediously appeared. The robot was in JRuby††, allowing access to HtmlUnit's functionality without the tedium of Java coding. It was slow, but I didn't care — run it overnight once a month, then run reports off the DB generated.

The coding approach was simple: Pretend you are a user. Access each page, starting with the login page, and do what the user would do. Scrape the interesting numbers as they appear. Append appropriate rows to the DB.

†http://htmlunit.sourceforge.net/

††http://jruby.org/

hayksaakian12y ago

Before any naysayers complain about the idea of using undocumented endpoints, keep in mind that this is all in the context of web scraping.

timscott12y ago

I've recently been learning all this the hard way.

1. Documented API. Failing that...

2. HTTP client fetching structured data (XHR calls). Failing that...

3. HTTP client fetching and scraping HTML documents. Failing that...

4. Headless browser

I recently found myself pushed to #4 to handle sites with over-complex JS or anti-automation techniques.

wslh12y ago

If you liked this article, you might also be interested in "Scraping Web Sites which Dynamically Load Data" http://blog.databigbang.com/scraping-web-sites-which-dynamic...

corford12y ago

For JS heavy sites, I've found proxying the traffic through Fiddler is the easiest way to discover the API end points I need to hit.

cpayne12y ago

I'm getting a 404 - Page not found

cynwoody12y ago

http://tubes.io/blog/2013/08/29/web-scraping-javascript-heav...

rshomali12y ago

Still the link doesn't work

1 more reply

j / k navigate · click thread line to collapse

15 comments

rgarcia12y ago

[0] http://mitmproxy.org/

gwu7812y ago

     case $# in
     1)
     # capture HTTP headers to pcap file
     tcpdump -Ulvvvns1024 -w- tcp 2>/dev/null \
     |ngrep -O$1 -qtWbyline 'GET|POST|HEAD' >/dev/null 
     ;;
     2)
     # search HTTP headers in pcap file
     ngrep -Wbyline -qtI$1 $2
     ;;
     *)
     echo usage: $0 pcap-file \[RE\]
     esac

To dump your results, try

     $0 pcap-file . |less

    case $# in
    [12])
    above-script $1 ${2-.} \
    |sed -n '
    /GET/p;
    /Host: /p;
    '  \
    |tr '\012' '\040' \
    |sed 's/GET/\
    &/g' \
    |awk '
    !($0 in a){a[$0];print "http://"$5$2}
    ' \
    |sed '
    s/%25/%/g;
    s/\.\//\//;
    ' \
    |unvis -hH \
    |sed '/^http:[/][/]./!d; 
    s/ /%20/g' \
    ;;
    *)
    echo usage: $0 pcap-file \[RE\] >&2
    esac

PMan7412y ago

Wow I thought mitmproxy looked rough until I saw tcpdump/ngrep/awk? They both work of course but neither look especially easy to use.

We've been using http://www.charlesproxy.com/ for years, great tool (cheap albeit not free)

1 more reply

hazz12y ago

bdcravens12y ago

1) lot of library support, in multiple languages

2) without having to fake UAs, etc, the requests look more like a regular user (all media assets downloaded, normal browser UA, etc)

3) simple clustering: setting up a Selenium grid is very easy, and switching from local instance of Selenium to using the grid requires very little code change (1 line in most cases)

cynwoody12y ago

†http://htmlunit.sourceforge.net/

††http://jruby.org/

hayksaakian12y ago

Before any naysayers complain about the idea of using undocumented endpoints, keep in mind that this is all in the context of web scraping.

timscott12y ago

I've recently been learning all this the hard way.

1. Documented API. Failing that...

2. HTTP client fetching structured data (XHR calls). Failing that...

3. HTTP client fetching and scraping HTML documents. Failing that...

4. Headless browser

I recently found myself pushed to #4 to handle sites with over-complex JS or anti-automation techniques.

wslh12y ago

If you liked this article, you might also be interested in "Scraping Web Sites which Dynamically Load Data" http://blog.databigbang.com/scraping-web-sites-which-dynamic...

corford12y ago

For JS heavy sites, I've found proxying the traffic through Fiddler is the easiest way to discover the API end points I need to hit.

cpayne12y ago

I'm getting a 404 - Page not found

cynwoody12y ago

http://tubes.io/blog/2013/08/29/web-scraping-javascript-heav...

rshomali12y ago

Still the link doesn't work

1 more reply

j / k navigate · click thread line to collapse