Ask HN: Do you scrape data? What do you use it for?

7 pointsjmaccabee8y ago8 comments

I've got a theory that in 2018, most people or companies can use web scraping tools to make some aspect of their lives easier.

For example, I wrote a script a few months back to scrape transit data and send me a text when the subway is delayed for my morning commute. Maybe you're helping your company scrape competitor prices to know when they change?

So - are you scraping data today? As part of a hobby project or for professional purposes? What do you use it for? And if you don't, why not?

8 comments

klez8y ago

I do scrape. We are making a sort of meta-search engine for long term car rental, so to bootstrap it we are scraping offer from various sites and directing users to the original websites to actually go through with the offer.

If anyone is interested in more detail, I can explain further.

deadcoder09048y ago

Go on

klez8y ago

Ok, here it goes.

The scrapers are written in python 2, because that's what the guys who started the project were familiar with.

Most of the scraping is done by hand with XPath queries on the pages we fetched, so no beautifulsoup or stuff like that. Again, I think mostly because those who started the project were not familiar with the task. It's not even that bad when I need to modify something (because the page changes etc.), as the code is very well written.

The problems started when the CEO and CTO proposed (mandated?) to use something made by a guy who is supposed to be a web scraping expert in the same domain we're working in (I don't doubt that, but still...). The software it gave us is written in Ruby (which no-one here ever even saw a line of) and Rails, and works with recipes instead of the imperative code + xpath we used at the beginning. It works flawlessly until it doesn't. Mainly because there's a big logical error (if an offer disappears from the original website we should mark it as deleted on our db, but the scrapers tells us it's still there) and I don't have time to learn ruby, rails and the whole system to fix this. And the original dev is not available anymore. So we're phasing that out and going back to our nice land of python :)

Anyway the process goes like this:

1. Scraper fetches the page, scrapes the data, and generates a JSON file with the info of all the offers it finds

2. Those JSON files are uploaded on S3

3. A trigger on S3 calls the "writer" on a EC2 instance, that downloads the JSON file, unpacks the content and writes the data to a postgres database.

Current problem: scraping arbitrary strings representing car options and categorizing them. We have something like 15,000 strings that need to be put inside a category. Manually.

xstartup8y ago

Ad scraping can be very profitable.

See: https://adplexity.com/

is_true8y ago

I don't understand something. How do they know how many impressions ads get? They are just scraping or they have access to ad networks stats?

xstartup8y ago

It's determined based on guesswork.

If you scrap a website, you'll know how what % of times an ad shows up.

Now, if you know the distribution for top N ads. You can skip everything and go straight to the ad network and check their total number of impression volume on particular website/GEO. You can pretend to be an advertiser and easily get access to this info.

Now, you know the distribution and total, can't you work out the impression count for each ad?

1 more reply

skinnymuch8y ago

How profitable is Adplexity? It’s used as an example.

j / k navigate · click thread line to collapse

8 comments

klez8y ago

If anyone is interested in more detail, I can explain further.

deadcoder09048y ago

Go on

klez8y ago

Ok, here it goes.

The scrapers are written in python 2, because that's what the guys who started the project were familiar with.

Anyway the process goes like this:

1. Scraper fetches the page, scrapes the data, and generates a JSON file with the info of all the offers it finds

2. Those JSON files are uploaded on S3

3. A trigger on S3 calls the "writer" on a EC2 instance, that downloads the JSON file, unpacks the content and writes the data to a postgres database.

Current problem: scraping arbitrary strings representing car options and categorizing them. We have something like 15,000 strings that need to be put inside a category. Manually.

xstartup8y ago

Ad scraping can be very profitable.

See: https://adplexity.com/

is_true8y ago

I don't understand something. How do they know how many impressions ads get? They are just scraping or they have access to ad networks stats?

xstartup8y ago

It's determined based on guesswork.

If you scrap a website, you'll know how what % of times an ad shows up.

Now, you know the distribution and total, can't you work out the impression count for each ad?

1 more reply

skinnymuch8y ago

How profitable is Adplexity? It’s used as an example.

j / k navigate · click thread line to collapse