It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.
It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.
disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested
Or maybe that can be implemented in Manolo's GUI. It should not be difficult as it is based on Django.
Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.
I know this tool was by a developer, but scrapinghub has web UI to make scrapers.
I have personally used Scrapy in the past, I find it to be a great tool.
Congratulations on your work!
“You can’t visit 160,000 people,” she notes. “But
you can easily interrogate 160,000 records.”
http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:
http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...
Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.
There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:
https://www.whitehouse.gov/briefing-room/disclosures/visitor...
But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:
http://www.nytimes.com/2010/06/25/us/politics/25caribou.html
And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...
There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/
And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors
This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.
Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?
Both the House and the Senate have gift travel databases (travel that's reimbursed by an outside group, such as a charter flight to visit an oil drilling rig) [2]
The branches differ in how such things are reported...this was pretty obvious recently when Justice Scalia died at a ranch and people started wondering who paid for the trip...take one look at how these forms are supplied and it should be pretty obvious why we don't normally hear about SCOTUS relationships until something really weird happens [3].
This NYT editorial "So Who's a Lobbyist?" has a nice rundown of the ways that people who would generally be considered a lobbyist can escape disclosure requirements: http://www.nytimes.com/2012/01/27/opinion/so-whos-a-lobbyist...
Still, it's useful to be able to parse the dataset in an attempt to find what's missing...something that is difficult to do conceptually unless you're dealing with the actual dataset on your own system.
[1] https://ethics.house.gov/gifts/house-gift-rule
I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.
Thanks for sharing all the links and information!
Did you mean omissions?
Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.
My own project, Transparent[1], has similar goals.
For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?
I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.
So developers, what is the perfect web scraping tool you envision?
And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.
It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.
Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.
Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.
Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.
I completely accept how important scraping is as a data source, but that doesn't make it any more legal. It's in a space right now where only big companies can take unmitigated advantage of the tool, because it'd cost millions of dollars to successfully defend a CFAA suit.
( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )
On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.
If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.
Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).
Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.
There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...
Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.
Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".
Otherwise, I'd recommend you check out Portia (open source). We're in the middle of releasing the beta 2.0 version.
Is this even a realistic business model? Seems like this is what Scrapy is doing and what Import.io is doing. Make the tool free in order to get free marketing and then charge people willing to pay money to extract data.
Meanwhile I see Mozenda charging like 5 cents for each page extracted, do you think this is a fair model or does it not matter?
It's a hard problem to generalize.
> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped
Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.