Ok, here it goes.
The scrapers are written in python 2, because that's what the guys who started the project were familiar with.
Most of the scraping is done by hand with XPath queries on the pages we fetched, so no beautifulsoup or stuff like that. Again, I think mostly because those who started the project were not familiar with the task. It's not even that bad when I need to modify something (because the page changes etc.), as the code is very well written.
The problems started when the CEO and CTO proposed (mandated?) to use something made by a guy who is supposed to be a web scraping expert in the same domain we're working in (I don't doubt that, but still...). The software it gave us is written in Ruby (which no-one here ever even saw a line of) and Rails, and works with recipes instead of the imperative code + xpath we used at the beginning. It works flawlessly until it doesn't. Mainly because there's a big logical error (if an offer disappears from the original website we should mark it as deleted on our db, but the scrapers tells us it's still there) and I don't have time to learn ruby, rails and the whole system to fix this. And the original dev is not available anymore. So we're phasing that out and going back to our nice land of python :)
Anyway the process goes like this:
1. Scraper fetches the page, scrapes the data, and generates a JSON file with the info of all the offers it finds
2. Those JSON files are uploaded on S3
3. A trigger on S3 calls the "writer" on a EC2 instance, that downloads the JSON file, unpacks the content and writes the data to a postgres database.
Current problem: scraping arbitrary strings representing car options and categorizing them. We have something like 15,000 strings that need to be put inside a category. Manually.