> Be aware that removing the lock file is not a solution and may break your system.
Guess what? My system is already broken! It’s refusing to do what I want it to do (install a package so that I can test it) in favor of doing what some remote agent (the update service) told it to do.
At this point I just immediately delete the .lock file when I see that message. Never has given me any problems. If it did, I’d just reimage the VM and probably be done sooner than if I had waited for that lock.
So whereas on Ubuntu, two competing apt commands cause one to fail, with no real recourse except a very nightmarishly long investigation resulting in the above article, in Gentoo, two competing commands … just … work.
I have a little app (written in golang) installed on each box that effectively is a task runner. Tasks can be written to do anything, including apt-get installing software.
If apt-get fails to run, the task fails (context.WithTimeout) and is run again at a later date. No random hacks needed. Everything is built to be idempotent, self-healing and eventually consistent.
It checks the dpkg lock, but I believe apt has its own lock, I believe at /var/lib/apt/lists/lock , and that option name strongly implies that it only checks the dpkg lock, still leaving you with a race condition on the apt one.
But one of my colleagues figured out that it is probably because the apt-get is getting locked due to cloud-init and removed the flakiness by making packer wait[1] for cloud-init to complete before running the installation scripts that involved apt-get locking.
We too wished that there were more docs to help us, especially explaining how apt-get worked.
[1] https://github.com/hashicorp/packer/issues/2639#issuecomment...
Please use Packer to build your images; don't do it on ASG instance deploy.
If somebody in the company tells you you're not allowed to build your own images, tell them to go fuck themselves, and write an e-mail to that guy's boss's boss explaining how much engineering time you're wasting (and how likely the products are to fail due to ASGs trying to bootstrap systems on the fly) because they won't let you cut an AMI.
So you start using Packer to build your images. Your Packer script does "apt get install" and fails because something is holding the apt lock, and the author ends up writing more or less the same article.
Additionally, I work in Azure, and VM images in Azure are a world of absolute nightmarish pain there: a normal user literally cannot make the API call to bring up a VM with a custom image if the image is not in the same tenant. There is a way to do it with SP, but it is so completely and thoroughly undocumented as to be black magic. (Yes, if you Google, you'll be able to find Azure documentation on this exact subject. No, the instructions do not work.)
Yeah, I agree at the end of the day, it's the right way to do things, but it is an absolute, utter, PITA.
But then again, if you choose to not bring up a VM with a custom image, you get an unpinned image: "Ubuntu 20.04 LTS" is a moving target, and we once got one with a kernel that would BUG after ~5 minutes. Azure needed us to tell them what kernel we got from them.
Incidentally the author's suggestions are not that good, and not what I would suggest. He should be waiting for cloud-init to finish, not just apt:
cloud-init status --wait
Before I knew about that, I used the following, which is still better than what the author suggests: while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done
Other stuff happens in cloud-init besides locking apt for awhile, and these will wait for all of it to finish.