This is the same thing that's been bothering me with archive.org lately, by the way. I haven't found a good way to simply (for some reasonable definition definition of "simple") contribute 10 TiB or so of redundant storage on my (european) home server either. That kind of thing might (have to) serve to ensure tamper-resistance for that data, given the current political climate on both sides of the pond. Any pointers welcome.
Maybe this?
> In addition to the data collection, we are releasing open source software and documentation for replicating our work and creating similar repositories. With these tools, we aim not only to preserve knowledge ourselves but also to empower others to save and access the data that matters to them.
https://github.com/harvard-lil/data-vault
And since the data lives here: https://source.coop/repositories/harvard-lil/gov-data/descri...
Combined with this:
> To download an individual dataset by name you can construct its URL, such as:
> https://source.coop/harvard-lil/gov-data/collections/data_go...
> https://source.coop/harvard-lil/gov-data/metadata/data_gov/f...
> To download large numbers of files, we recommend the aws or rclone command line tools:
> aws s3 cp s3://us-west-2.opendata.source.coop/harvard-lil/gov-data/collections/data_gov/<name>/v1.zip --no-sign-request
So one could "easily" mirror the whole thing, making it distributed.
Sensitive information just can't be hosted on a centralized server anymore, it has to be distributed for the good of the project.
Is there any risk of the government ordering them to take it down? That seems unlikely to me. The US has strong free speech protection, stronger than European free speech protection.
>keeping this data online against the express will of the government is gonna cost (political) capital.
Costing them political capital (aka the government is unhappy) is different from the government ordering them to take it down. Also, when you say "express will", are you saying the government has explicitly publicly stated that they don't like that Harvard is hosting this data?
The US literally just told (e.g., [0]) all scientists working for it they are not permitted to publish papers or speak at conferences or travel. What the US "has" is irrelevant when laws are being ignored.
[0] https://www.science.org/content/blog-post/revised-and-extend...
In other words, you don't think we have to worry about Congress bringing in university presidents to grill them over political activities against the government's current policies occurring on their campuses? Certainly we wouldn't see any of those whose didn't give testimony that the government liked would end up being forced to resign in the aftermath; we're not in the dark ages anymore like we were in 2023...
Citation needed. "European free speech protection" doesn't exist, each country has its own rules and freedoms. Hungary is drastically less free than e.g. France... but overall, as a rule of thumb, unless for you freedom of speech includes freedom to be a Nazi, European countries are pretty free. And don't take stances such as "spending money is free speech which is sacred so no campaign financing rules". You might get arrested for a Nazi salute though... which, if you think is a bad thing, you haven't been paying attention in history classes nor modern news.
For this to be said in the current circumstances shows either a political viewpoint or a lack of knowledge.
Definitely a concern, if they want to harass Harvard and the other universities they could, but I don't think they'll bother. They know the data will be backed up, that's not the point.
Taking it off of data.gov accomplishes two things:
1) Makes it look like they're doing something, playing to the base. Easy to do
2) Delegitmize any insights the data might have. "Sure you have 'data', is it official data? I don't see it on data.gov. How do we know its not fraudulent?" It makes it harder to use it to justify policy changes. It adds one more tool to the denial crowd.
Yup, I was about to ask whether Trump could still force them to delete what he doesn't like. Time will tell, I guess.
Muskolites are taking on the SSN system without any Congressional oversight as we speak. The President is attacking ius soli which is a Constitutional right. If they decide that sending their sleuths to Cambridge MA to physically destroy this data is in their best interest, they will do so and handle the courts later. Just stop pretending they will play by the book.
Harvard like other liberal institutions has little to no political capital in a Republican white house in the first place. Why would it cost them any to host data that is in the public domain?
This sort of silly overreaction is part of why people voted for Trump: it is genuinely funny to see people overreact to Trump because theyve been told by the legacy media that he is a "threat to democracy" over and over.
Last time he was elected his government also removed a bunch of climate data from govt websites which was quickly mirrored by third parties. Nobody was taken away by the Gestapo then and there is no reason to think things will be any different this time.
I'm not sure pointing out 'last time' is such a great idea considering it ended with a violent insurrection attempt which I watch in real time. I don't need the news or anyone else to tell me that a threat to democracy is exactly what he is.
https://www.vanityfair.com/news/2020/07/trump-administration...
Someone forgot the "kids in cages" episodes in 2018 when Trump was only testing the waters...
EDIT: https://en.wikipedia.org/wiki/Trump_administration_family_se...
Theres something that about this that just rings second amendment. Personally I think the concept of civilians having weapons to be a check on a nation state is absurd, but in this case it feels pretty empowering.
This is a topic that came up at work today as we rely on this data and are considering backing up most of the Lidar data from there ourselves (100s of TB probably)
EDIT: no, looks like it is only the footprints
I haven't had much time to look at this yet and see what all is there, but whether currently included or not, a couple of things I really hope get archived are the contents of the DTIC (Defense Technical Information Center) document repository (lots of really interesting older scientific publications) and the NASA TRS (Technical Report Server).
I'm working on my own archive of at least some portion of the DTIC stuff just to be on the safe side. So far everything I've tried to access is still there, but who knows how long that will last.
"where they burn books, they will ultimately burn people as well."
Those who delete research will ultimately delete people as well.