User-agent: *
Disallow: /buy/
Disallow: /checkout/
So, do you have permission to violate robots.txt, as I'm sure there is some automated interaction with checkout/purchasing pages? Or I am I missing something about how TwoTap works? Scraping is one thing, but accessing when the management of the website prohibits it seems like a big no no.I'd mention more on the BD side but can't at this point for competitive reasons. The fact that we currently support sending orders through to 450 retailers does not mean we have deals in place with all of them, but that the infrastructure is built to allow this to happen -- if affiliates or publishers get an approval from retailers or the affiliate networks that govern this. Perhaps we should make this clearer on the supported stores page.
All in due time. The industry as a whole is being pushed to decide which models they will embrace -- and as always some will be slower to adapt than others. The pressure comes from lost revenue on mobile which makes retailers a LOT more flexible now compared to even 6 months ago when talking about this.
Considering multiple format screens and devices fragmenting retailers distribution channels over the next years this is set to become an even bigger chapter down the line.
Does your crawler obey robots.txt rules?
User-Agent: established_company
Allow: /some-stuff
User-Agent: *
Disallow: /
# keeps out filthy peasants
And you're either stuck following them, and not having data that would be offered up for free if you were someone else, or being a bad person and ignoring it. You don't really see the services that follow the rules.
Also, good paper on how much being on robots.txt preferred helps, which makes you a better product, which makes you more preferred...
We don't spider retailer websites. That means we don't follow links or go hardcore on building a database of products.
We hit your website:
* if someone has asked us information about a product url
* when we place an order
* weekly for regression tests
Ping us on contact@ and we're more than happy to jump on a call and describe exactly what we're doing. Most of the time we're completely un-noticeable except for the fact that you're getting more orders.
We know for sure nobody is spidering through us.
IANAL, but I think the best bet for staying technically legal is to use jurisdictional arbitrage and tit-for-tat to liberate the data. If someone scrapes a US server and are in the US and they generate enough load to deprive the owner of use, then they are technically liable for damages under trespass to chattels. If they instead trade scraping labor with people in other jurisdictions, then that other entity would be liable. There might be some other legal defense/attack that might be usable by the entity who has the data being liberated, but I reckon it would be tenuous at best.
Wikipedia has some insight into the legal issues with web scraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
At least for me it looks like one are better of adding technical counter measures against scrapers then to try a legal route.
But, wouldn't it be more beneficial to get websites to open up an API to you, communicate to them to do so, or even offer consulting services to build an API?
I know that there are a few cart/store offerings out there. it seems to me that they would have an API.
Magneto: http://www.magentocommerce.com/api/soap/checkout/checkout.ht...
OpenCart Propretary API: http://opencart-api.com/
Prestashop API: http://doc.prestashop.com/display/PS14/Using+the+REST+webser...
There's companies that are trying to get retailers to implement APIs but this leads to a fragmented ecosystem. Year's past payment processors that sold "pay/checkout with ..." buttons and wallets have failed to achieve significant merchant adoption despite being fuelled with marketing spend in the billions.
The solution everyone embraces seems to rely in building an independent and neutral piece of infrastructure (an API) that any publisher can integrate and that plugs into every checkout out there. It's the missing pipes in ecommerce, anyone can use it and nothing really changes (we don't process payments, it's all automated etc) -- and conversions go UP.
I'm repeating some ideas in the post but on the publisher side it's worth noting NONE would entertain the idea of integrating multiple APIs -- one for each merchant. Did I also bring up the required combined efforts of all merchants to keep those APIs up & running? :)
So pro-scraping because it's the only way to build adoption in ecommerce.
Also, if you are scraping a large retailer you are effectively required to be PCI DSS level 1 compliant, which takes a bit of extra effort.
And yes, PCI DSS compliance is also crucial to storing and handling credentials. We're going through the process again this year at Two Tap, but the effort is worth it.
I deal with a high return rate industry [specialty products many customers can't size correctly] and I only see return rates of 3-7% depending on the product. 40% seems very high.
(They also prioritized the feeds that were sent to them directly by retailers above the scraped items feeds - thus prioritizing paid listings, similar to the Google SERPs - so a different business model entirely.)
That being said, a very cool concept - and agreed that, given the relatively-small number of ecommerce platforms out there, scraping then erving them up seems pretty scalable. Interested to see how it goes.
The downside to feeds is that they become obsolete very quickly, especially if the product is popular. Products sell out very quickly, retailers lose money on traffic they can't onboard and shoppers get frustrated.
Thanks for your thoughts!
Which of these "top 5 shopping search engines" have you worked with? You don't seem to mention any on your website.
> The downside to feeds is that they become obsolete very quickly, especially if the product is popular. Products sell out very quickly, retailers lose money on traffic they can't onboard and shoppers get frustrated.
Feeds are the only way to keep up with frequently changing listings from large retailers (apart from doing live API requests) since scraping is several orders of magnitude slower. Amazon gives selected partners incremental feeds, scraping their millions of products takes days.
FYI, Lego showed me the French version of their website as it's where I live. You seems to only offer shipping in the US though that's not clear reading your website. Still very interesting.
Product URL: http://shop.lego.com/fr-FR/Le-ch%C3%A2teau-fort-70404?fromLi...
Screenshot: http://imgur.com/mlr8Q2e
Stay tuned though, we'll have news on this.
If you try this (same product, US version) it would work perfectly: http://shop.lego.com/en-US/King-s-Castle-70404?_requestid=25...
If I was using URLs gathered from a Commission Junction datafeed, is this basically a plug and play solution? Or do I need to process those URLs?
Do you have a backend stats dashboard? Or would I still rely on CJ for that data?
All the commissioning, connecting/talking to retailers, receiving the money, is directly between you and the affiliate network. We're plug and play :)
We do have a stats backend where you can see all the purchases that went through Two Tap. And you can also use CJs dashboard just like you are probably doing right now.
We're fetching the live data only for the products requested via an URL. Two Tap mimics a consumer visiting the retailer and getting that info for themselves which also allows retailers to retain their analytics layer with no negative impact.
Our current supported stores span the top 500 as well as a number of specific integration requests.
Also, the full retailer inventory is available, unlike FB or other models that require the shop to upload a certain number of products.
We have our placing orders infrastructure on AWS, and whole in-house cloud dedicated to product crawling built on top of Digital Ocean.
Retailers can extend their reach and make their inventory shoppable from anywhere with an internet connection and publishers can build ecommerce in their apps.