How Web Scraping Is Revealing Lobbying and Corruption in Peru (opens in new tab)

(blog.scrapinghub.com)

398 pointsbezzi10y ago72 comments

72 comments

I'm from Ukraine and the biggest success in battling corruption comes from system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.

[1] https://github.com/openprocurement/

carlosp42010y ago

Hi there, I am the author of the blog post. I will be happy to answer any question.

gearhart10y ago

This is great work. Forgive me if I'm missing it, but since the blog post implies you're aggregating and cleaning the data from several lists, is there any way to see the latest additions (RSS etc?) rather than directly searching for individuals?

It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.

disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested

carlosp42010y ago

I never thought of that, but certainly having a RSS feed is a great idea. I have not done it as the journalists have not requested it. So far they have been asking me for more spiders so Manolo would include visit records from other Peruvian institutions.

1 more reply

nsoldiac10y ago

Carlos, super buen trabajo, felicitaciones!! Llevo tiempo estudiando temas relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo testimonios interesantes de contactos que han vivido el cambio post-tecnología en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en cualquier momento feliz de apoyarte!

carlosp42010y ago

muchas gracias! En el Perú ya hay varios grupos de periodistas que se han asociado con programadores para hacer proyectos interesantes de periodismo de datos. Está Ojo Publico, Convoca e IDL reporteros. Pero igual no nos damos abasto hay tanto por hacer!

1 more reply

RodericDay10y ago

Igualmente. También vivo en el extranjero (Canada), pero estoy totalmente dispuesto a apoyar en cualquier iniciativa de este tipo.

1 more reply

juandbarraza2410y ago

Good work! It would be interesting to cross match the visits with any other source of information (newspaper, wikileaks, etc.) Over a timeline to recreate the hole event of someone. This will allow to identify patterns and their modus operandi.

nsoldiac10y ago

It would be interesting to see the volume of visits by government office year over year. I have a feeling that periods around elections might look very different. Also would be interested to see distribution color-coded by industry. Mining and contracting should pop up for certain time periods and government agencies.

carlosp42010y ago

yes. So far we have a very simple API http://manolo.rocks/docs/ With this API, it is possible to download all the structured data kept in Manolo and do such interesting analyses.

Or maybe that can be implemented in Manolo's GUI. It should not be difficult as it is based on Django.

sergiotapia10y ago

Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.

ecthiender10y ago

Very interesting, how tools like these can be so much helpful for journalists and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make scrapers.

unsettledtck10y ago

Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia - our open source visual web scraper. It's for those who range from non-technical to technical but want a quick way to scrape data. I think it's extremely important to develop tools to democratize the acquisition of data regardless of technical background and skill. Glad you find the article and tool interesting!

ecthiender10y ago

Yes, totally agree with you on the great potential of tools for easy data acquisition.

I have personally used Scrapy in the past, I find it to be a great tool.

Congratulations on your work!

1 more reply

benologist10y ago

A similar thing happened in Costa Rica -

    “You can’t visit 160,000 people,” she notes. “But 
    you can easily interrogate 160,000 records.”

http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...

xiphias10y ago

Can you draw a covisit graph of people? Who visited the building at the same times as somebody else. The strength of the connections could be visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))

alecco10y ago

In other countries, corrupt politicians found out a simple captcha per n items is good enough to defeat analysis.

smarx00710y ago

https://anti-captcha.com/ & https://rucaptcha.com/ - I think that can be best summarised as "from Russia with love" :)

danso10y ago

FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:

Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:

http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...

Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.

There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:

https://www.whitehouse.gov/briefing-room/disclosures/visitor...

But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:

http://www.nytimes.com/2010/06/25/us/politics/25caribou.html

And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...

There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/

And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors

This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.

hackuser10y ago

Thanks; it's invaluable to hear from someone who has experience with the data.

Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?

danso10y ago

I don't have much experience with the lobbying rules except for times that I've had to research things specifically. Usually disclosure requirements come with a minimum amount...In the House (not sure if the exact limits apply to the Senate...), the ethics rules are quite strict but not everything is recorded...for example, a legislator (or their staff) can only receive $100 of gifts from a single source in a calendar year..."gifts" being basically anything of value...but things under $10 don't count toward that limit. So getting Frappuccinos everyday with your favorite CEO probably wouldn't be recorded in any official capacity even though not only do those add up monetarily, but someone getting coffee with a legislator on a frequent basis would be a huge point of potential influence. However, legislators aren't allowed to get gifts (such as paid dinners) at all from a registered lobbyist [1].

Both the House and the Senate have gift travel databases (travel that's reimbursed by an outside group, such as a charter flight to visit an oil drilling rig) [2]

The branches differ in how such things are reported...this was pretty obvious recently when Justice Scalia died at a ranch and people started wondering who paid for the trip...take one look at how these forms are supplied and it should be pretty obvious why we don't normally hear about SCOTUS relationships until something really weird happens [3].

This NYT editorial "So Who's a Lobbyist?" has a nice rundown of the ways that people who would generally be considered a lobbyist can escape disclosure requirements: http://www.nytimes.com/2012/01/27/opinion/so-whos-a-lobbyist...

Still, it's useful to be able to parse the dataset in an attempt to find what's missing...something that is difficult to do conceptually unless you're dealing with the actual dataset on your own system.

[1] https://ethics.house.gov/gifts/house-gift-rule

[2] http://clerk.house.gov/public_disc/giftTravel.aspx

[3] http://pfds.opensecrets.org/N99999918_2008.pdf

unsettledtck10y ago

I just ran across https://www.opensecrets.org/ and found it quite useful and comprehensive in tracking contributions to candidates.

I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.

Thanks for sharing all the links and information!

jacquesm10y ago

> which does have some outright admissions

Did you mean omissions?

danso10y ago

oops, yes :)

justinlardinois10y ago

Damn. This is pretty impressive.

prawn10y ago

Peruvians, do you think this would cause a majority of meetings to be held outside of public office buildings or via secretive messaging system?

dkarp10y ago

This is really impressive, even more so by the fact that it has already led to discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] https://www.transparentmetric.com/

Angostura10y ago

This is a fascinating project - If successful I suspect the result will be that lobbying to longer takes place in the government offices: "Shall we meet at that little place down the street", or will be carried out over the phone.

jorgecurio10y ago

Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?

I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.

So developers, what is the perfect web scraping tool you envision?

And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.

It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.

unsettledtck10y ago

Full disclosure, I work for Scrapinghub. Our tools are Scrapy and Portia, both open source and both free as in beer. Scrapy is for those who want fine-tuned manual control and who have a background in Python. Portia is the visual web scraper for those who are non-technical to technical but don't want to bother with code.

Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.

Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.

Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.

cookiecaper10y ago

Every time I see a scrapinghub post I ask the same question: what's your strategy for dealing with CFAA suits that arise from use of your platform? Most web scraping is illegal in the United States.

I completely accept how important scraping is as a data source, but that doesn't make it any more legal. It's in a space right now where only big companies can take unmitigated advantage of the tool, because it'd cost millions of dollars to successfully defend a CFAA suit.

3 more replies

emluque10y ago

I recently did a website, that mines Argentinian Central Bank statistics daily and generates graphics and reports: http://estadisticasbcra.com/en

( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )

On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.

predius10y ago

I agree, this is definitely a solved problem!

If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.

Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).

Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.

There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...

Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.

iamdave10y ago

It doesn't scale very well, unless you have a lot of patience...but I've had immense success using the importxml() function in Google sheets to compile raw election data while I did some freelance work for the Texas Libertarian party a couple of years ago.

Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".

austinhutch10y ago

After Kimono got shut down, I think a self-hosted open source version would be extremely popular. I want to build my own solution, but the API functionality and pagination / AJAX loaded data would be too difficult.

unsettledtck10y ago

Not sure if you're interested, but we (Scrapinghub) do offer a Kimono to Portia migration https://blog.scrapinghub.com/2016/02/25/migrate-your-kimono-...

Otherwise, I'd recommend you check out Portia (open source). We're in the middle of releasing the beta 2.0 version.

jorgecurio10y ago

interesting, how would a self-hosted open source version make money tho in order to support itself and continue to upgrade?

Is this even a realistic business model? Seems like this is what Scrapy is doing and what Import.io is doing. Make the tool free in order to get free marketing and then charge people willing to pay money to extract data.

Meanwhile I see Mozenda charging like 5 cents for each page extracted, do you think this is a fair model or does it not matter?

3 more replies

logn10y ago

> an already solved problem

It's a hard problem to generalize.

> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped

Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.

j / k navigate · click thread line to collapse

72 comments

kilotaras10y ago

I'm from Ukraine and the biggest success in battling corruption comes from system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.

[1] https://github.com/openprocurement/

carlosp42010y ago

Hi there, I am the author of the blog post. I will be happy to answer any question.

gearhart10y ago

It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.

disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested

carlosp42010y ago

1 more reply

nsoldiac10y ago

carlosp42010y ago

1 more reply

RodericDay10y ago

Igualmente. También vivo en el extranjero (Canada), pero estoy totalmente dispuesto a apoyar en cualquier iniciativa de este tipo.

1 more reply

juandbarraza2410y ago

nsoldiac10y ago

carlosp42010y ago

yes. So far we have a very simple API http://manolo.rocks/docs/ With this API, it is possible to download all the structured data kept in Manolo and do such interesting analyses.

Or maybe that can be implemented in Manolo's GUI. It should not be difficult as it is based on Django.

sergiotapia10y ago

Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.

ecthiender10y ago

Very interesting, how tools like these can be so much helpful for journalists and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make scrapers.

unsettledtck10y ago

ecthiender10y ago

Yes, totally agree with you on the great potential of tools for easy data acquisition.

I have personally used Scrapy in the past, I find it to be a great tool.

Congratulations on your work!

1 more reply

benologist10y ago

A similar thing happened in Costa Rica -

    “You can’t visit 160,000 people,” she notes. “But 
    you can easily interrogate 160,000 records.”

http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...

xiphias10y ago

alecco10y ago

In other countries, corrupt politicians found out a simple captcha per n items is good enough to defeat analysis.

smarx00710y ago

https://anti-captcha.com/ & https://rucaptcha.com/ - I think that can be best summarised as "from Russia with love" :)

danso10y ago

FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:

Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:

http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...

There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:

https://www.whitehouse.gov/briefing-room/disclosures/visitor...

http://www.nytimes.com/2010/06/25/us/politics/25caribou.html

And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors

hackuser10y ago

Thanks; it's invaluable to hear from someone who has experience with the data.

danso10y ago

Both the House and the Senate have gift travel databases (travel that's reimbursed by an outside group, such as a charter flight to visit an oil drilling rig) [2]

[1] https://ethics.house.gov/gifts/house-gift-rule

[2] http://clerk.house.gov/public_disc/giftTravel.aspx

[3] http://pfds.opensecrets.org/N99999918_2008.pdf

unsettledtck10y ago

I just ran across https://www.opensecrets.org/ and found it quite useful and comprehensive in tracking contributions to candidates.

Thanks for sharing all the links and information!

jacquesm10y ago

> which does have some outright admissions

Did you mean omissions?

danso10y ago

oops, yes :)

justinlardinois10y ago

Damn. This is pretty impressive.

prawn10y ago

Peruvians, do you think this would cause a majority of meetings to be held outside of public office buildings or via secretive messaging system?

dkarp10y ago

This is really impressive, even more so by the fact that it has already led to discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] https://www.transparentmetric.com/

Angostura10y ago

jorgecurio10y ago

Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?

So developers, what is the perfect web scraping tool you envision?

It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.

unsettledtck10y ago

Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.

Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.

cookiecaper10y ago

Every time I see a scrapinghub post I ask the same question: what's your strategy for dealing with CFAA suits that arise from use of your platform? Most web scraping is illegal in the United States.

3 more replies

emluque10y ago

I recently did a website, that mines Argentinian Central Bank statistics daily and generates graphics and reports: http://estadisticasbcra.com/en

( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )

predius10y ago

I agree, this is definitely a solved problem!

If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.

Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.

iamdave10y ago

austinhutch10y ago

unsettledtck10y ago

Not sure if you're interested, but we (Scrapinghub) do offer a Kimono to Portia migration https://blog.scrapinghub.com/2016/02/25/migrate-your-kimono-...

Otherwise, I'd recommend you check out Portia (open source). We're in the middle of releasing the beta 2.0 version.

jorgecurio10y ago

interesting, how would a self-hosted open source version make money tho in order to support itself and continue to upgrade?

Meanwhile I see Mozenda charging like 5 cents for each page extracted, do you think this is a fair model or does it not matter?

3 more replies

logn10y ago

> an already solved problem

It's a hard problem to generalize.

Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.

j / k navigate · click thread line to collapse