I built it up using nmap and then shelling into each individual machine and poking around to see what it did. This was back in the days before everything became virtualized, so each machine on the network was likely physical.
I added information by walking the aisles and copying down the rack location of every machine into another page on the spreadsheet. I eventually hooked up a terminal to them all and matched network addresses to physical machines.
Only took a few weeks and when I was done, I knew things about the network that guys who worked at the business for years didn't know.
There's no substitute for the good old-fashioned way.
I liked that job, it was fun.
So you have exactly those sorts of problems that RDBMS are designed to solve. Therefore it makes sense to move to a DCIM system using an RDBMS under the hood, that allows for concurrent edits, and also can be accessed by automation (cronjobs, CI, etc.) via some sort of API (or direct DB read access).
This means you do not have two sources of truth to maintain (what is in the RDBMS, and how that relates to what is in the infrastructure code repository), the RDBMS system does not have to reinvent versioning, you can see exactly how your infrastructure evolves, you can do atomic changes to both the infrastructure code and the infrastructure information that the code relies on (obviously you need a modern version control system for this), and the infrastructure code can access the infrastructure information in a much more straightforward (and much easier to test) way.
We connect Ansible and Collins through ansible-cmdb (https://github.com/fboender/ansible-cmdb), then tie the entire thing to our ticketing systems ServiceNOW (https://www.servicenow.com/) and Jira Service Desk (https://www.atlassian.com/software/jira/service-desk), and finally, ensure we have history tracking with Slack (https://www.slack.com).
As a given, we yank test the entire world. If it doesn't pass a yank, it straight up doesn't exist.
Whether it's bare-metal, virtualized, para-virtualized, dockerized, mixed-mode, or cloud - we 100% do this all the time. There is not a single change across any environment, that isn't fully tracked, fully reproducible, fully auditable, and fully automated.
That way we know our CMDB is accurate, our workflows are accurate, credentials, ansible, terraform, images, etc. Right down to tickets.
It's how we manage all of our cloud customers.
- configure everything as code (we use Ansible for the infrastructure up to OS level, Kubernetes w/ Helm for applications), have it read the values from the DCIM so that the DCIM remains the single source of truth (we need to still get better on this part....)
Links: https://github.com/digitalocean/netbox https://www.ansible.com https://www.kubernetes.io
That's at work. At home, I do much of the same, except that maintaining a DCIM is excessive for 2 VPS and a home network of 3 boxes.
For a relatively small setup I chose a combination of Ansible, Kubernetes and Dockerfiles, but probably any combination will do. All these files are stored in a git repo.
Even after months (or years) neglect, I can easily know what I configured (and why!) and update where needed with a minor effort.
I agree the end-goal should be infrastructure as code, and everyone here has covered those tools well. You also want monitoring across your infrastructure. Prometheus is the new poster-boy here, but the Nagios family, and many other decent OSS solutions exist as well.
But you still need documentation. Your documentation should exist wherever you spend most of your time. Some examples:
* If you spend most of your time on a Windows Desktop, doing windows admin type things, then OneNote or some other GUI note-taking/document program makes sense.
* If you spend most of your time in Unix land(linux, BSD, etc) then plain text files on some shared disk somewhere for everyone to get to, makes WAY more sense. Bonus if you put these files in a VCS, and treat it like code, and super bonus if your documentation is just a part of your Infra as code repositories.
* If you spend your time in a web browser, then use a Wiki, like MediaWiki, wikiwiki, etc.
In other words, put your documentation tools right alongside your normal workflow, so you have a decent chance of actually using it, keeping it up to date, and having others on your team(s) also use it.
We put our docs in the repo's right alongside the code that manages the infrastructure.. in plain text. It's versioned. We don't publish it anywhere, it's just in the repo, but then we spend most of our time in editors messing in that repo.
Instead of documenting all the commands involved in configuring a machine as service X (ssh, run apt-get, paste this, etc.), I have documentation on how work with the configuration management system (roles in the roles/ directory, each node gets one role, commit to git, open PR, etc.). That documentation is in .md files in the config management source repo.
Instead of documenting how to rack a server (print and attach label to front and back, plug power into separate PDUs, enter PDU ports into management database, etc.), I document Terraform conventions (use module foo, name it xxx-yyy, tag with zzz, etc.).
It ends up being less documentation, as the "code" serves to document the steps taken, so the documentation can be higher level. Or if it isn't less documentation, it is documentation that needs to be updated less often, so hopefully there will be less drift between docs and what actually exists.
You still need high-level stuff, policies, etc. Security guides, none of this has changed.
You also have to document your snowflakes, how you handle the wacky snowflakes, why they exist, etc.
Ideally your documentation should be such that it would pass the hit-by-a-bus test. I.e. if you or your entire team got hit by a bus, someone with a clue could come in, read your documentation and continue.
My docs are not at that stage, but every time I mess about with something I try to read through the docs attached, and verify and add to them, so that hopefully someday we will get there.
Source: 16 years in various ops roles
We have used it for years and it has worked great for us.
If you are all or mostly cloud, Terraform + config management with a CI pipeline takes care of a lot. Then a wiki that covers "Getting Started" and a few how-to articles.
For physical infra you need the setup for DHCP, updating DNS based on DHCP, PXE boot imaging, IPMI access and configuration, switch and router configuration, what servers are connected to which switch ports, PDU management and monitoring, and on and on and on.
You end up with something like NetBox (https://github.com/digitalocean/netbox) or Collins (https://tumblr.github.io/collins/), plus a bunch of other stuff gluing things together.
In larger infrastructure setups (small service provider) we used a combination of netboot, SNMP for monitoring with Observium and Nagios for alerting. We were also a big VMware environment, so naturally we had a lot of inventory tracking available through vCenter as well. I found a lot of opposition to Configuration Management, given the lack of comfort with programming of some sysadmins (Windows admins), so that's something to keep in mind as well. I think mixed environments also can be challenging w/infrastructure as code, but I'd be interested to see how others get through that.
My current thoughts are that an appropriate approach is for your systems to document themselves via the applications that they run - inside out.
Though I must abide I cannot fully subscribe to "infrastructure as code" anymore. It has proven just another shift, primarily in toolsets and who (or what) gets say and sway over the capacity, capabilities and efficiencies of the thing you actually care about - the app stack and all of its assembled functionality.
In other words most approaches are still "outside in" - one defines 'x' for deploy fitments and that typically over and over and over again and, typically, with a rigidity that can too easily override and overrule effectively caging your application in scale and scope. With my current tact I am trying to provide for 'y' to "self identify" (via some/any form of config mgmt) where from here you can begin to effectively "deploy to any" by hooking the "application config as code" that, in turn, defines its infrastructure and deploys "outward". The "infrastructure as code" then becomes the servant with its objects and platform definitions etc. and the "appconfig as code" becomes the master where the latter defines its own scope and scale.
Infrastructures have a funny way of mutating into inefficient "definitions" of something that once made sense, on the first day, and forevermore complicating progress with capacity, rules and opinions.
But, generically, snmp is still pretty cool for telling me what I need to know. Strapped that into any end engine and, boom, ask any question, request any inventory.
So.. I track apps, not systems. Systems are expendable, applications are not.
Legacy stuff is done the old fashioned way - portscans and nmap. If it has an open port, it's presumed to be intentional. If not, it's a target. I've seen some success using tools like Pysa to "blueprint" existing systems into Puppet code. Tools like SystemImager help here, too - enabling P2V and the creation of "file-based images" compatible with version control and able to PXE boot new clones.
New stuff is from-scratch IaC all the way to the metal. Ansible and git submodules help me build "sandwiches".
Critical stuff blurs the lines. The machines, IP addresses, ports and living connectivity can be documented, and "captured" to a limited extent with the manual mapping and Rsync stuff in the Legacy category. Some of this critical stuff is also "new", and is deployed in that fashion.
What about switchgear and Cisco configs? License strings, key management, site-specific patching - all can complicate things.
More important than any of these is the ability for you and those around you to see and manage the systems as they are launched and terminated.
In the old days, I used to use a shell script on a newly-provisioned host to dump all its' details - dmidecode, environment stuff and so on. Those details were pushed back to a common source and were a real benefit in the days before real config management came on the scene. CFEngine was way too complicated and nebulous at the time.
There are a couple of exceptions, but those are actively being brought under the above model (mostly because they are effectively invisible, and the existing documentation for them is... incomplete).
Any documentation outside of that is stale in a few hours, and obsolete in a week.
http://terraform.io/ http://datadoghq.com/ https://aws.amazon.com/cloudwatch/
https://www.usenix.org/short-topics/documentation-writing-sy...
Also, this talk was very good:
https://www.usenix.org/legacy/event/lisa08/tech/gelb_talk.pd...
If you want something more clever; say keeping track of asset values etc, you'll want a CMDB. Google around and you should find something that fits your needs. We used SeviceNow in a previous life.
We're on AWS so we use cloudformation for provisioning and saltstack (https://saltstack.com/) for configuration management. Cloudformation templates are written using stacker (http://stacker.readthedocs.io/en/stable/). All AWS resources are built by running "stacker build" so nothing is done by hand. We have legacy resources that we're slowly moving over to Cloudformation, but more than 90% of our infrastructure is in code.
On top of cloudformation and salt we built jenkins (CI and docker image creations), spinnaker (deployment pipeline), and kubernetes (deployment target). The jenkins and spinnaker pipelines are also codified in their own respective git repos.
All the repos here have sphinx setup for documentation purposes and the repos tend to crosslink for references.
So, why do you structure your lambda jobs accessing CloudWatch Logs that way as opposed to the other way? If you didn’t know that one way works and the other doesn’t, you wouldn’t be able to understand that question. And that might have domino effects on other parts of your system.
I haven’t found a good solution to documenting the high level strategic “why” questions, other than to just write down the questions and the answer, with reasoning, in some form of associated documentation — maybe in a wiki or something. But, of course, the underlying issues may change in the near future and invalidate the reasons for your decision. And the high level documentation doesn’t have any way to be compiled directly into the lower level implementation, so of course there is always the risk of drift.
I’m still looking for good solutions in this space.
Come up with a key/value strategy that covers your need to track things like app name, app category, environment (test, dev, load testing, prod, prod/dmz, etc), and it becomes actually usable and up to date versus an always out-of-date CMDB. And it's compatible with cloud resource tagging.
Sometimes, less is more.
But I also use the o365 Suite.
Mediawiki is also good, but can be a bore to run another service for that.
But in the end a textfile via notepad/nano is all you need, really.
"Documentation", in terms of where stuff is deployed and what is deployed is not really necessary. We save this data to a DynamoDB table, query-able by AWS Lambda functions, so other automation can pick it up and devops can query data.
Documentation on how things work comes from dev teams, on how things are deployed indeed comes from us, just simple wiki pages.
Services running in Kubernetes, K8s worker instances in auto-scaling groups. If one node dies it is killed and brought up, K8s will reschedule the pods. Same for the pods themselves.
Monitoring through Nagios(getting phased out finally), NewRelic and Prometheus. Basic ELK stack for centralized logs.
Thinking about rolling out Vault for credential management. Chatops on the pipeline (getting pieces in place first, like the db mentioned earlier)
I'm trying to get the company on board on immutable infrastructure, but it is proving difficult.
But I feel like it's lacking. After a while you have so many ansible playbooks and roles that they cannot give you a birds-eye view anymore.
I think I would MUCH prefer to have some sort of HTML representation, where adding an instance/service starts by adding to that representation, and you could click on every link or node to show its golden image setup, ansible configuration, etc.
THAT, I could show to a newcomer and he'd get it.
I'm not sure if it also shows some infrastructure graphs, but I'm talking about knowing if links are up, how they are firewalled, where the config for each thing is, etc.
When you host tens of services on hundreds of machines, this information is hard to get a grasp on, no matter what you do or how well you documented everything, because it takes a while to read through it.
- https://www.bookstackapp.com/ for portable (Markdown), searchable (SQL), manageable (Users) documentation.
- Ansible for automation and deployment.
- Prometheus for monitoring all the Proxmox nodes and containers.
GitLab for repositories, adhoc documentation via gists and CI/CD.
Nagios for monitoring.
Open to trying other things out if they make sense.
(like, I don't want to have to configure 1000 moving parts)
typically this seems to fall into the 'roll your own' or 'giant lumbering enterprise behemoth' category that does 10 other things. I'm looking for the sweet spot.
That being said route53 has a reasonable management API.
(e.g. self hosted, but without needing 5 different polyglot microservices and a service managment layer and 32GB of ram just to keep the whole mess running)
I see 0% need for this complexity in many cases on the presentation side - and if faster response is required internally, the same API IF can be used for service discovery or side-chain announcements, etc can be bolted on on a per-application basis if desired.
I also see 0% need for this to be a cloud exclusive domain - e.g. hybrid scope/location deployments, etc.
should probably look more closely at powerdns..
however this doesn't solve the dhcp side of things.
specifically not looking at SaaS since I want consistent deployment flexibility and potential for mixed scale/scope/environment deployments (devbox, lan, wan, mixture, yadda)