Why scraping and ecommerce are a perfect fit (opens in new tab)

(blog.twotap.com)

119 pointssradu11y ago55 comments

55 comments

Interesting. So, it seems like you aren't respecting robots.txt. I picked Old Navy, as it was on your supported stores page [0], and went to their robots.txt [1]

    User-agent: *
    Disallow: /buy/
    Disallow: /checkout/

So, do you have permission to violate robots.txt, as I'm sure there is some automated interaction with checkout/purchasing pages? Or I am I missing something about how TwoTap works? Scraping is one thing, but accessing when the management of the website prohibits it seems like a big no no.

[0] : https://twotap.com/supported-stores/

[1]: http://oldnavy.gap.com/robots.txt

razvanr11y ago

Oldnavy and a few other retailers are not active yet. We've pre-built these integrations despite not having requests to sell their inventory just yet.

I'd mention more on the BD side but can't at this point for competitive reasons. The fact that we currently support sending orders through to 450 retailers does not mean we have deals in place with all of them, but that the infrastructure is built to allow this to happen -- if affiliates or publishers get an approval from retailers or the affiliate networks that govern this. Perhaps we should make this clearer on the supported stores page.

All in due time. The industry as a whole is being pushed to decide which models they will embrace -- and as always some will be slower to adapt than others. The pressure comes from lost revenue on mobile which makes retailers a LOT more flexible now compared to even 6 months ago when talking about this.

Considering multiple format screens and devices fragmenting retailers distribution channels over the next years this is set to become an even bigger chapter down the line.

tommccabe11y ago

Looks like I'm at one of the retailers you crawl. Recently, our site was getting hit with a web crawler that was following links incorrectly. I black listed several IP addresses from accessing the site and now I wonder if it was this!

Does your crawler obey robots.txt rules?

patmcguire11y ago

Probably they don't, because so much of the web has robots files like

User-Agent: established_company

Allow: /some-stuff

User-Agent: *

Disallow: /

# keeps out filthy peasants

And you're either stuck following them, and not having data that would be offered up for free if you were someone else, or being a bad person and ignoring it. You don't really see the services that follow the rules.

Also, good paper on how much being on robots.txt preferred helps, which makes you a better product, which makes you more preferred...

https://etda.libraries.psu.edu/paper/9230/4516

sraduOP11y ago

Tom, that most likely wasn't us.

We don't spider retailer websites. That means we don't follow links or go hardcore on building a database of products.

We hit your website:

* if someone has asked us information about a product url

* when we place an order

* weekly for regression tests

Ping us on contact@ and we're more than happy to jump on a call and describe exactly what we're doing. Most of the time we're completely un-noticeable except for the fact that you're getting more orders.

We know for sure nobody is spidering through us.

tommccabe11y ago

thanks for the confirmation. how do you get the product URL in the first place?

razvanr11y ago

That's the app developer's responsibility.

1 more reply

josephjrobison11y ago

I'm confused about the legality of scraping. Is it completely open, or are there some restrictions on scraping any site without explicit permission?

malandrew11y ago

AFAIK, the main legal issue is a trespass to chattels tort. The data collected is generally uncopyrightable if not reproduced in it's entirety without modifications. The relevant case is Feist v. Rural [0].

IANAL, but I think the best bet for staying technically legal is to use jurisdictional arbitrage and tit-for-tat to liberate the data. If someone scrapes a US server and are in the US and they generate enough load to deprive the owner of use, then they are technically liable for damages under trespass to chattels. If they instead trade scraping labor with people in other jurisdictions, then that other entity would be liable. There might be some other legal defense/attack that might be usable by the entity who has the data being liberated, but I reckon it would be tenuous at best.

[0] http://en.wikipedia.org/wiki/Feist_v._Rural

incunix11y ago

Doubt they would prosecute if they are making more money through the scrapping.

brentm11y ago

The only thing illegal about it is you're potentially violating local copyrights. Sites generally state their stance on scraping in the terms of use.

runarb11y ago

IANAL, but I doubt that that is an issue in most jurisdiction. A website does not get to make their own law by simply putting up a note. Neither are the terms of use a binding contract between two parties because the scrapers are not the websites customers, and thus did not sign or agree to anything.

Wikipedia has some insight into the legal issues with web scraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues

At least for me it looks like one are better of adding technical counter measures against scrapers then to try a legal route.

monksy11y ago

I don't understand why you're pro-scrapping. ( I did write a blog post on this, and I believe that I posted it to HN before: http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-o... )

But, wouldn't it be more beneficial to get websites to open up an API to you, communicate to them to do so, or even offer consulting services to build an API?

I know that there are a few cart/store offerings out there. it seems to me that they would have an API.

Magneto: http://www.magentocommerce.com/api/soap/checkout/checkout.ht...

OpenCart Propretary API: http://opencart-api.com/

Prestashop API: http://doc.prestashop.com/display/PS14/Using+the+REST+webser...

razvanr11y ago

Good question. That's because this method doesn't scale and fails as a solution to the industry's challenges.

There's companies that are trying to get retailers to implement APIs but this leads to a fragmented ecosystem. Year's past payment processors that sold "pay/checkout with ..." buttons and wallets have failed to achieve significant merchant adoption despite being fuelled with marketing spend in the billions.

The solution everyone embraces seems to rely in building an independent and neutral piece of infrastructure (an API) that any publisher can integrate and that plugs into every checkout out there. It's the missing pipes in ecommerce, anyone can use it and nothing really changes (we don't process payments, it's all automated etc) -- and conversions go UP.

I'm repeating some ideas in the post but on the publisher side it's worth noting NONE would entertain the idea of integrating multiple APIs -- one for each merchant. Did I also bring up the required combined efforts of all merchants to keep those APIs up & running? :)

So pro-scraping because it's the only way to build adoption in ecommerce.

grandalf11y ago

The hard part is not scraping, it's returns. For many kinds of online products, the return rate is over 40%. The shopper must be completely aware of how to contact the merchant of record and how to return the product.

Also, if you are scraping a large retailer you are effectively required to be PCI DSS level 1 compliant, which takes a bit of extra effort.

razvanr11y ago

Completely agree! Returns and not breaking the retailer's CRM is key in the space. Retailers are happy especially because we're not breaking their relationship with consumers nor obscuring payment/shipping data. It's very difficult to find and build a model that's accepted and actively supported by all stakeholders in the ecommerce space and we're very excited by current efforts in the industry. Long story short, we've had no complaints or confusion from consumers so far on returns, customer support etc. By now most consumers are aware of in-stream or remote buying and following basic guidelines helps a lot too (clearly displaying the retailer logo, user messaging etc)

And yes, PCI DSS compliance is also crucial to storing and handling credentials. We're going through the process again this year at Two Tap, but the effort is worth it.

opendais11y ago

Are you sure that is true for all retailers of those products?

I deal with a high return rate industry [specialty products many customers can't size correctly] and I only see return rates of 3-7% depending on the product. 40% seems very high.

grandalf11y ago

It's very true for higher end goods. These goods typically are sold with free/cheap shipping and so customers will order 3 sizes of an item and return all but the one that fits.

opendais11y ago

What are you defining as higher end?

$100? $200? Unusual measurements required? [e.g. Not just a standard size but additional dimensions]

'cause if you are, I'm still standing by my experience.

1 more reply

rgbrenner11y ago

Zappos has a 35% return rate.

http://www.internetretailer.com/2010/03/31/get-back

2 more replies

GFK_of_xmaspast11y ago

I interviewed for a data position at a major internet clothing store last year and they said their return rate was in that ballpark. (I don't fully remember the number)

opendais11y ago

Wow. We must be exceptionally good for some reason then 'cause some of the items I'm talking about are specialty shoes that cost $100+.

1 more reply

lloyddobbler11y ago

I've worked with two shopping search engines, and interestingly, scraping sites was one of the things they did to build up their inventory as well. The big difference being, they simply organized the products into a searchable format, then sent traffic to the ecommerce site and let them handle the checkout . What you're doing is arguably more complex.

(They also prioritized the feeds that were sent to them directly by retailers above the scraped items feeds - thus prioritizing paid listings, similar to the Google SERPs - so a different business model entirely.)

That being said, a very cool concept - and agreed that, given the relatively-small number of ecommerce platforms out there, scraping then erving them up seems pretty scalable. Interested to see how it goes.

razvanr11y ago

At least one of them might have been using our technology in the backend, especially if they're one of the top 5 shopping search engines.

The downside to feeds is that they become obsolete very quickly, especially if the product is popular. Products sell out very quickly, retailers lose money on traffic they can't onboard and shoppers get frustrated.

Thanks for your thoughts!

lazyjones11y ago

> At least one of them might have been using our technology in the backend, especially if they're one of the top 5 shopping search engines.

Which of these "top 5 shopping search engines" have you worked with? You don't seem to mention any on your website.

> The downside to feeds is that they become obsolete very quickly, especially if the product is popular. Products sell out very quickly, retailers lose money on traffic they can't onboard and shoppers get frustrated.

Feeds are the only way to keep up with frequently changing listings from large retailers (apart from doing live API requests) since scraping is several orders of magnitude slower. Amazon gives selected partners incremental feeds, scraping their millions of products takes days.

coupdejarnac11y ago

I built a CJ scraper for a deals website that is now defunct. What a pain it was to maintain. All the different retailers dump their data into CJ in different ways. I might just put it on github if anyone's interested. Python + chromedriver + beautifulsoup + mechanize

Jake23211y ago

I'd be interested in seeing this.

blaze3311y ago

I tried the demo with a Lego castle priced 99€ and got a grand total of more than $10k...

FYI, Lego showed me the French version of their website as it's where I live. You seems to only offer shipping in the US though that's not clear reading your website. Still very interesting.

Product URL: http://shop.lego.com/fr-FR/Le-ch%C3%A2teau-fort-70404?fromLi...

Screenshot: http://imgur.com/mlr8Q2e

razvanr11y ago

Nice :) We currently focus on the US market with both retailer as well as publisher integrations, we should make that clearer perhaps.

Stay tuned though, we'll have news on this.

sraduOP11y ago

Oops, it's because it's the french version of the website. We currently support only US retailers.

If you try this (same product, US version) it would work perfectly: http://shop.lego.com/en-US/King-s-Castle-70404?_requestid=25...

dchuk11y ago

Can anyone go into a bit more detail about how the affiliate commissions work here? From what I have read, I would feed my affiliate link through TwoTap and you would then handle the cookie and conversion and everything?

If I was using URLs gathered from a Commission Junction datafeed, is this basically a plug and play solution? Or do I need to process those URLs?

Do you have a backend stats dashboard? Or would I still rely on CJ for that data?

sraduOP11y ago

We simulate what a shopper would do. We first go through your affiliate link (which drops a cookie) and then go on the retailer website to place the order.

All the commissioning, connecting/talking to retailers, receiving the money, is directly between you and the affiliate network. We're plug and play :)

We do have a stats backend where you can see all the purchases that went through Two Tap. And you can also use CJs dashboard just like you are probably doing right now.

dchuk11y ago

Thanks for the reply. Are the retailers cool with all of this?

1 more reply

quaffapint11y ago

So you guys are scraping all the product information for a retailer and keeping it up to date? Or is it all live, you fetch it when that particular url is called? Where do you get the list of retailers to scrape?

razvanr11y ago

Not really, we're not building a product catalog.

We're fetching the live data only for the products requested via an URL. Two Tap mimics a consumer visiting the retailer and getting that info for themselves which also allows retailers to retain their analytics layer with no negative impact.

Our current supported stores span the top 500 as well as a number of specific integration requests.

Animats11y ago

This is sort of what Google Shopping was before it went all-ads.

opendais11y ago

That isn't quite right. Many vendors provided them with XML feeds of their products just as they do with Amazon.

razvanr11y ago

Correct. We don't get any input from the retailers. Two Tap can get product availability info and place an order just by having the product URL, nothing else.

Also, the full retailer inventory is available, unlike FB or other models that require the shop to upload a certain number of products.

dmritard9611y ago

How many proxy nodes do you have?

sraduOP11y ago

Paraphrasing newrelic "it takes a village to count our proxy nodes".

We have our placing orders infrastructure on AWS, and whole in-house cloud dedicated to product crawling built on top of Digital Ocean.

dmritard9611y ago

hm, I didn't mean infrastructure, I mean, did you buy proxy nodes from someone like sslprivateproxy and slap HAProxy in front... Most ecommerce sites wise up to bots crawling them and have a robots.txt that suggests you might not want to...

michaelmcmillan11y ago

This seems hard, but I think that's your big advantage (business-wise).

notastartup11y ago

I don't get it. Is this just a middle man between all the retail websites and the publishers? Sort of like what Google is doing with the product search and also giving comissions on the items sold?

razvanr11y ago

Yes, you could say that. We're laying down pipes in ecommerce so you can send an order to a merchant from anywhere on the web in a standardised API.

Retailers can extend their reach and make their inventory shoppable from anywhere with an internet connection and publishers can build ecommerce in their apps.

notastartup11y ago

Interesting but why the need for web scraping? If the retailers saw the monetary benefit of this wouldn't they go out of their way to provide you the data directly and an API as well?

lurcio11y ago

Tap certainly would solve the feed problem - but thats a problem for affiliates, not retailers :-)

Its likely that I don't have much clue what I'm talking about, but I can't see great benefits for many retailers from this as it stands. They like more control over leads and to onboard customers into their brand. Typically, this is managed through affiliate programs and partnerships.

Two Tap looks like it allows prospects/customers to be kept under the publishers wing. The retailer gets the transaction, but the publisher keeps the relationship - where the real value (capital) is created and can be realised.

As someone with an interest in a small time publisher - I'm very interested in this.

However, I have a concern with how well it sits with affiliate t&c's. These vary by program of course - but those I know prohibit scraping by any means. That's stopped me doing similar on 80legs or other in the past.

If above is right, Two Tap may well have to develop the kind of relationships with retailers that better serves their needs at the mouth of the funnel. Then Tap risks becoming like anyone other program (CJ/AW..) - & the scraping will likely have to stop. I'm sure they know the space better than anyone here though - and certainly better than me - so I'd be interested to hear if they have these relationships in place with the scraping MO, or are confident of a way forward.

Given that - the CTA only leaves me wondering whats on the other side of the wall: "Sounds interesting? Let’s talk! - Sign up below"

Do you have traffic/other requirements. Whats the pricing? It'd be nice if you could inform me more/make me work less if I want to use your service :-)

1 more reply

j / k navigate · click thread line to collapse

55 comments

firloop11y ago

Interesting. So, it seems like you aren't respecting robots.txt. I picked Old Navy, as it was on your supported stores page [0], and went to their robots.txt [1]

    User-agent: *
    Disallow: /buy/
    Disallow: /checkout/

[0] : https://twotap.com/supported-stores/

[1]: http://oldnavy.gap.com/robots.txt

razvanr11y ago

Oldnavy and a few other retailers are not active yet. We've pre-built these integrations despite not having requests to sell their inventory just yet.

Considering multiple format screens and devices fragmenting retailers distribution channels over the next years this is set to become an even bigger chapter down the line.

tommccabe11y ago

Does your crawler obey robots.txt rules?

patmcguire11y ago

Probably they don't, because so much of the web has robots files like

User-Agent: established_company

Allow: /some-stuff

User-Agent: *

Disallow: /

# keeps out filthy peasants

Also, good paper on how much being on robots.txt preferred helps, which makes you a better product, which makes you more preferred...

https://etda.libraries.psu.edu/paper/9230/4516

sraduOP11y ago

Tom, that most likely wasn't us.

We don't spider retailer websites. That means we don't follow links or go hardcore on building a database of products.

We hit your website:

* if someone has asked us information about a product url

* when we place an order

* weekly for regression tests

We know for sure nobody is spidering through us.

tommccabe11y ago

thanks for the confirmation. how do you get the product URL in the first place?

razvanr11y ago

That's the app developer's responsibility.

1 more reply

josephjrobison11y ago

I'm confused about the legality of scraping. Is it completely open, or are there some restrictions on scraping any site without explicit permission?

malandrew11y ago

[0] http://en.wikipedia.org/wiki/Feist_v._Rural

incunix11y ago

Doubt they would prosecute if they are making more money through the scrapping.

brentm11y ago

The only thing illegal about it is you're potentially violating local copyrights. Sites generally state their stance on scraping in the terms of use.

runarb11y ago

Wikipedia has some insight into the legal issues with web scraping: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues

At least for me it looks like one are better of adding technical counter measures against scrapers then to try a legal route.

monksy11y ago

I don't understand why you're pro-scrapping. ( I did write a blog post on this, and I believe that I posted it to HN before: http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-o... )

But, wouldn't it be more beneficial to get websites to open up an API to you, communicate to them to do so, or even offer consulting services to build an API?

I know that there are a few cart/store offerings out there. it seems to me that they would have an API.

Magneto: http://www.magentocommerce.com/api/soap/checkout/checkout.ht...

OpenCart Propretary API: http://opencart-api.com/

Prestashop API: http://doc.prestashop.com/display/PS14/Using+the+REST+webser...

razvanr11y ago

Good question. That's because this method doesn't scale and fails as a solution to the industry's challenges.

So pro-scraping because it's the only way to build adoption in ecommerce.

grandalf11y ago

Also, if you are scraping a large retailer you are effectively required to be PCI DSS level 1 compliant, which takes a bit of extra effort.

razvanr11y ago

And yes, PCI DSS compliance is also crucial to storing and handling credentials. We're going through the process again this year at Two Tap, but the effort is worth it.

opendais11y ago

Are you sure that is true for all retailers of those products?

I deal with a high return rate industry [specialty products many customers can't size correctly] and I only see return rates of 3-7% depending on the product. 40% seems very high.

grandalf11y ago

It's very true for higher end goods. These goods typically are sold with free/cheap shipping and so customers will order 3 sizes of an item and return all but the one that fits.

opendais11y ago

What are you defining as higher end?

$100? $200? Unusual measurements required? [e.g. Not just a standard size but additional dimensions]

'cause if you are, I'm still standing by my experience.

1 more reply

rgbrenner11y ago

Zappos has a 35% return rate.

http://www.internetretailer.com/2010/03/31/get-back

2 more replies

GFK_of_xmaspast11y ago

I interviewed for a data position at a major internet clothing store last year and they said their return rate was in that ballpark. (I don't fully remember the number)

opendais11y ago

Wow. We must be exceptionally good for some reason then 'cause some of the items I'm talking about are specialty shoes that cost $100+.

1 more reply

lloyddobbler11y ago

razvanr11y ago

At least one of them might have been using our technology in the backend, especially if they're one of the top 5 shopping search engines.

Thanks for your thoughts!

lazyjones11y ago

> At least one of them might have been using our technology in the backend, especially if they're one of the top 5 shopping search engines.

Which of these "top 5 shopping search engines" have you worked with? You don't seem to mention any on your website.

coupdejarnac11y ago

Jake23211y ago

I'd be interested in seeing this.

blaze3311y ago

I tried the demo with a Lego castle priced 99€ and got a grand total of more than $10k...

FYI, Lego showed me the French version of their website as it's where I live. You seems to only offer shipping in the US though that's not clear reading your website. Still very interesting.

Product URL: http://shop.lego.com/fr-FR/Le-ch%C3%A2teau-fort-70404?fromLi...

Screenshot: http://imgur.com/mlr8Q2e

razvanr11y ago

Nice :) We currently focus on the US market with both retailer as well as publisher integrations, we should make that clearer perhaps.

Stay tuned though, we'll have news on this.

sraduOP11y ago

Oops, it's because it's the french version of the website. We currently support only US retailers.

If you try this (same product, US version) it would work perfectly: http://shop.lego.com/en-US/King-s-Castle-70404?_requestid=25...

dchuk11y ago

If I was using URLs gathered from a Commission Junction datafeed, is this basically a plug and play solution? Or do I need to process those URLs?

Do you have a backend stats dashboard? Or would I still rely on CJ for that data?

sraduOP11y ago

We simulate what a shopper would do. We first go through your affiliate link (which drops a cookie) and then go on the retailer website to place the order.

All the commissioning, connecting/talking to retailers, receiving the money, is directly between you and the affiliate network. We're plug and play :)

We do have a stats backend where you can see all the purchases that went through Two Tap. And you can also use CJs dashboard just like you are probably doing right now.

dchuk11y ago

Thanks for the reply. Are the retailers cool with all of this?

1 more reply

quaffapint11y ago

razvanr11y ago

Not really, we're not building a product catalog.

Our current supported stores span the top 500 as well as a number of specific integration requests.

Animats11y ago

This is sort of what Google Shopping was before it went all-ads.

opendais11y ago

That isn't quite right. Many vendors provided them with XML feeds of their products just as they do with Amazon.

razvanr11y ago

Correct. We don't get any input from the retailers. Two Tap can get product availability info and place an order just by having the product URL, nothing else.

Also, the full retailer inventory is available, unlike FB or other models that require the shop to upload a certain number of products.

dmritard9611y ago

How many proxy nodes do you have?

sraduOP11y ago

Paraphrasing newrelic "it takes a village to count our proxy nodes".

We have our placing orders infrastructure on AWS, and whole in-house cloud dedicated to product crawling built on top of Digital Ocean.

dmritard9611y ago

michaelmcmillan11y ago

This seems hard, but I think that's your big advantage (business-wise).

notastartup11y ago

I don't get it. Is this just a middle man between all the retail websites and the publishers? Sort of like what Google is doing with the product search and also giving comissions on the items sold?

razvanr11y ago

Yes, you could say that. We're laying down pipes in ecommerce so you can send an order to a merchant from anywhere on the web in a standardised API.

Retailers can extend their reach and make their inventory shoppable from anywhere with an internet connection and publishers can build ecommerce in their apps.

notastartup11y ago

Interesting but why the need for web scraping? If the retailers saw the monetary benefit of this wouldn't they go out of their way to provide you the data directly and an API as well?

lurcio11y ago

Tap certainly would solve the feed problem - but thats a problem for affiliates, not retailers :-)

As someone with an interest in a small time publisher - I'm very interested in this.

Given that - the CTA only leaves me wondering whats on the other side of the wall: "Sounds interesting? Let’s talk! - Sign up below"

Do you have traffic/other requirements. Whats the pricing? It'd be nice if you could inform me more/make me work less if I want to use your service :-)

1 more reply

j / k navigate · click thread line to collapse