We store the source code for all services in subfolders of the same monorepo (one repo <-> one app). Whenever a change in any service is merged to master, the CI rebuilds _all_ the services and pushes new Docker images to our Docker registry. Thanks to Docker layers, if the source code for a service hasn't changed, the build for that service is super-quick, it just adds a new Docker tag to the _existing_ Docker image.
Then we use the Git commit hash to deploy _all_ services to the desired environment. Again, thanks to Docker layers, containers that haven't changed from the previous tag are recreated instantly because they are cached.
From the CI you can check the latest commit hash that was deployed to any environment, and you can use that commit hash to reproduce that environment locally.
Things that I like:
- the Git commit hash is the single thing you need to know to describe a deployment, and it maps nicely to the state of the codebase at that Git commit.
Things that do not always work:
- if you don't write the Dockerfile in the right way, you end up rebuilding services that haven't changed --> build time increases
- containers for services that haven't changed get stopped and recreated --> short unnecessary downtime, unless you do blue-green
To avoid rebuilding all services on every commit, we use Bazel to help determine what services need to be rebuilt. Note that we don't use Bazel as build system but just a tool to see what services are changed -- essentially we only use `filegroup` Bazel rule. After a push to git repo, we basically do (1) `git diff --name-only <before> <after>` to get changed files, (2) run `bazel query 'rdeps(..., set(list of changed files))'` at both `<before>` and `<after>` commits, and (3) combine the results of `bazel query` and look for the affected services.
Once we know what services need to be rebuilt, we trigger Jenkins jobs of those services. Each service will have its own Jenkins job and Jenkinsfile (we use Pipeline). Here we also package the application as Docker image and push it to the internal registry.
We keep track of what is released using "production" branch for each service. Once we have a build to release, we (1) create a "release candidate" branch from the commit of the build, (2) update the k8s config file, (3) apply the k8s config, and (4) merge this branch to the production branch of the service if everything is ok. Then we merge back the production branch to master branch.
A couple of things different that we do since we are building and then deploying to AWS:
- Build only on dedicated deployment branches (beta, qa, preview, prod)
- Build all functions (transpile, yarn, lint, etc) on every merge into the branch, but only deploy functions with different checksums (saves on api calls to AWS)
- We cache node_modules, but otherwise don't have any special build requirements and babel takes care of targeting node6.10 for Lambda
Total build time is between 8-13 minutes. There are some things we can do to speed up install that we haven't yet because it's not an issue yet but just a short list of things to note.
- Each function has it's own package.json for it's own packages. We maintain a list of npm packages that we download into a single folder first (that doesn't get deployed) to allow yarn to use those files from cache. We will eventually switch to an offline install for each function which essentially just copies the package folder and sets up anything it needs.
- We have a tarball package that includes all of our shared code / config files. Yarn seems to always want to download this file, regardless if we pre-download it.
- We deploy a single api endpoint for all of our micro services through API Gateway which cuts down on the time to deploy since API Gateway has a pretty hard throttle. This means we create a deployment on API Gateway every merge. We have one APIG for each environment
Looks like a pretty solid build process. Thanks for the insight!
Why are you rebuilding _all_ the services, wouldn't it make sense to just rebuild the ones that have changes? You're now rebuilding perfectly working services without any new changes just because some other service changed, or am I misunderstanding something here?
For example you might have a Git history like this:
* 89abcde Fix bug in service_b
* 1234567 Initial commit including service_a and service_b
When 89abcde is pushed, the CI rebuilds both service_a and service_b so we can simply "deploy 89abcde" and you always have only one hash for all services, that is also nicely the same hash of the corresponding Git commit.
The trick to avoid rebuilding perfectly working services is to use Docker layer caching so that when you build service_a (that hasn't changed) Docker skips all steps and simply adds the new tag to the _existing_ Docker image. The second build for service_a should take about 1 second.
In our Docker registry we end up with:
service_a:1234567
service_a:89abcde
service_b:1234567
service_b:89abcde
But the two service_a Docker images are _the same image_, with two different tags.
So I'm curious, does each service instance have their own server, or do you have multiple services on one server instance?
I have some experience working with microservices. I saw the clear business benefits of being able to map design domain boundaries to repos and specific teams, and to let those teams be able to control their deployments while minimizing external dependencies.
But we seemed to be paying a lot in network chattiness, slow site response times, and networking costs. I'm wondering if we could have minimized those costs by sticking some of those microservices on the same server instance. Not really change service boundaries or interfaces, but change the methods that the microservice interfaces use to communicate.
First - If your change to the container is near the end of the build process (see earlier comment about smart container design), then the rebuild will only change the final few hashes and Docker is smart enough to not rebuild earlier hashes.
Second - Hashes are global, so if you have multiple containers that start with the same base (say, Alpine Linux + Python + NMP + etc.), Docker will share existing hashed layers. This means a much smaller distribution payload.
To (what I think is) your original question - you can tag the 'final' container itself. Tagging it with the Git hash is one way to get exactly what you're talking about.
The builds for all services happen in parallel, so the longest one determines the total time. Big Scala services take much longer than small React frontends. We cache both Maven and NPM modules from previous builds.
Ideally, if the pull request only modified a React component and didn't touch any Scala file, no Scala build is triggered because Docker finds a cached layer and skips the "sbt compile" step. To be honest, we are still working to make sure this always happens, we still trigger unnecessary sbt compiles because the Docker cache is not used correctly.
It takes a build from your build system (typically team city, but not exclusively) deploys it and record the deployment.
You can then check later what's currently deployed, or what was deployed at some point in time in order to match it with logs etc.
Not sure how useable it would be outside of our company though.
Independent deployments are one of the key advantages of microservices. If you don't use that feature, why use microservices at all? Just for scalability? Or because it was the default choice?
You can deploy the whole platform and/or refactor to a monolith, and maintain one change log which is simple.
That however has its own downsides, so you should find a balance. If you're having trouble keeping track, perhaps re-organize. I read on one HN article that Amazon had 7k employees before they adopted microservices. The benefits have to outweigh the costs. Sometimes the solution to the problem is taking a step back. without more details its hard to say.
So basically one option is refactor [to a monolith] and re-evaluate the split such that you no longer have this problem. Just throw each repo in a sub-folder & make that your new mono-repo & go from there, it is worth an exploratory refactoring, but not a silver bullet.
Sounds like the services were no longer 'micro' :)
Every component comes with a major/minor release no., which tells about the nature of change that has gone in. For ex: Major rel is incremented for a change that usually introduces a new feature/interface. Minor release no are reserved for bug fixes/optimizations, that are more internal to the component.
The build manager can go through the list of all the delivered fixes and cherry pick the few which can go to the final build.
We have 200 services, counting beta and live test variants. Most of the difficulties vanished once we had declarative versioned control of our service config in the ‘headquarters’ repository.
Not aware of anyone else using this approach.
https://github.com/tim-group/orc
Basically, there's a Git repo with files in that specify the desired versions and states of your apps in each environment (the "configuration management database").
The tool has a loops which converges an environment on what is written in the file. It thinks of an app instance as being on a particular version (old or new), started or stopped (up or down), and in or out of the load balancer pool, and knows which transitions are allowed, eg:
(old, up, in) -> (old, up, out) - ok
(old, up, out) -> (old, up, in) - no! don't put the old version in the pool!
(old, up, out) -> (old, down, out) - ok
(old, up, in) -> (old, down, in) - no! don't kill an app that's in the pool!
(old, down, out) -> (new, down, out) - ok
(old, up, out) -> (new, up, out) - no! don't upgrade an app while it's running!
Based on those rules, it plans a series of transitions from the current state to the desired state. You can model state space as a cube, where the three axes of space correspond to the three aspects of the state, vertices are states, and edges are transitions, some allowed, some not. Planning the transitions is then route-finding across the cube. When i realised this, i made a little origami cube to illustrate it, and started waving it at everyone. My colleagues thought i'd gone mad.You need one non-cubic rule: there must be at least one instance in the load balancer at any time. In practice, you can just run the loop against each instance serially, so that you only ever bring down one at a time.
This process is safe, because if the tool dies, it can just start the loop again, look at the current state, and plan again. It's also safe to run at any time - if the environment is in the desired state, it's a no-op, and if it isn't, it gets repaired.
To upgrade an environment, you just change what's in the file, and run the loop.
Full disclosure: I'm on the Spinnaker team
A Slack notification could do it. Or do you want to correlate deployments with other metrics?
In this case we instrument our deployments into our monitoring stack (influxdb/grafana) and use this as annotations for the rest of our monitoring.
We can also graph the number of releases per project on different aggregates.
Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
Completely agree, that's why we instrument our releases so we can easily see what's deployed by service and environment.
> Then there is the issue of linking the Git release/tag with the corresponding changes, say from a ticketing system such as Jira. That can be helpful to communicate changes to other people within the organization and to users.
Each commit is related to a ticket, helps generate a changelog. We enforce a lot of things in each of our release. We have an internal release tool heavily inspired by shipit from Shopify. We have the concept of soft/hard checker to make sure it won't break or that you aware of what could break with the current diff.
> How do you define dependencies for releasing new versions to service? Likely going to happen at some point when you have non-trivial changes to services.
As I said we instrument our releases and can easily track how changes affects our performance/bugs.
We also try a lot not to release non-trivial changes in one big release by doing stuff like release part of the changes behind a feature flipper first or route only a part of the traffic to the new code path, ...
Then we don't have dozens of different services deployed and we're still a relatively small team (~20) so I'm pretty sure I don't have the full picture just yet :)
We also store stats in the service discovery app so versions can be promoted to "production" for a customer once the account management team has reviewed and updated their internal training.
For anyone that has begun the microservice journey, kubernetes can be intimidating but way worth it. Our original microservice infrastructure was rolled way before k8s and it's just night and day to work with now, the kubernetes team has thought of just about every edge case.
I could probably snapshot the kubernetes state to have an trail I can use to rollback to a point in time. Alternatively I thought about having CI updatemanifests in an integration repo and deploy from there, so that every change to the cluster is reflected by a commit in this repository.
- unit tests each service
- all services fan-in to a job that builds a giant tar file of source/code artefacts. This includes a metadata file that lists service versions or commit hashes
- this "candidate release" is deployed to a staging environment for automated system/acceptance testing
- it is then optionally deployed to prod once the acceptance tests have passed
We use Escape to version and deploy our microservices across environments and even relate it to the underlying infrastructure code so we can deploy our whole platform as a single unit if needs be.