Migrating from bare-metal to Kubernetes (GKE) with 0 downtime
Feb 20, 2024
8 min read
Development
A discussion surrounding the wonderful and worry-free experience associated with a 2 year process of a complete hosting infra fork-lift 🦄 🌈 .
Today VOXO runs happily ever after (🔨 🪵 ) on a multi-region completely 99% automated orchestration environment called Kubernetes (specifically Google Kubernetes Engine). I'm not just talking about our web API's, front-end client facing experiences and stateless services. I'm referring to everything under the 🌞. Many people say that running very stateful (real-time VoIP) workloads in a Kubernetes environment can't and/or shouldn't be done, but I see Kubernetes (along with other orchestration platform environments) as the future of all workloads whether stateful or non-stateful.
Reasons to migrate
You love pain
You want to future proof your telecom infrastructure
You don't want to become a DevOps/Platform Engineering heavy organization (most companies just need to sell a product)
You want to take advantage of the security benefits of Kubernetes
You want to use Kubernetes for what it's meant to be used for (keeping services online, and responding to changes in traffic volume)
You want to spend more on compute during the day and little to nothing at night (based on traffic changes)
You need multi-region/global capabilities
You could benefit from some type of service mesh and/or discovery
You want to recover rapidly from a complete multi-region outage
Reasons NOT to migrate
You value quality of life
You can't containerize any part of your environment
You don't have control of EVERY component in your environment down to the network layer
You don't want to upset any customers
You don't have the luxury of using a cloud-based Kubernetes environment
There are things about your environment (like public IP assignment/allocation, firewall rules, fault/failure tolerance, SSL) that you can't automate 💯
Astricon 2023 - Scaling & Managing Real-time Communication Workloads in Kubernetes
Slides
If you don't want to read the rest of this article just watch this low quality Astricon 2023 video ⬆️
Choosing a cloud
This was an easy one. GCP (in my opinion) had the best Kubernetes implementation along with a Global multi cluster ingress load balancer and simple to follow documentation for deploying all of the necessary components (even outside of Kubernetes) via yaml. In addition, they had sane defaults which helped me understand Kubernetes better and they handle master node upgrades for you. Lastly, they expose just enough to give you the power you need but not so much that it's completely overwhelming (like the other big players).
Check out the script that we use to stand up new clusters and even bootstrap a brand new environment from scratch.
First things first 🐳
I spent a good chunk of time just learning and creating Dockerfiles for all of our components ⬇️
Visual Component Diagram
Asterisk
Asterisk sidecar (NodeJS)
Jobs (NodeJS)
Kamailio
Kamailio sidecar (NodeJS)
Grafana agent
MySQL
MySQL sidecar (consul-template)
NATS (Messaging pub/sub)
NATS sidecar (consul-template)
API (NodeJS)
VItess orc
ProxySQL
ProxySQL sidecar (consul-template)
Rtpengine
Rtpengine sidecar (NodeJS)
The biggest challenge here was slimming down the container sizes. Where possibly I tried to use Alpine linux along with multi-stage build steps. The only component where this wasn't possible was Asterisk because of some codec translations but the asterisk container size only came in at 342Mb. Also, each of these components are in separate repo's so figuring out the GitHub ci workflow was a bit of a challenge making sure all PR's and releases build properly.
Deploying the first cluster
Here's a quick snapshot of where we are at this time. We have a colo bare-metal environment in Atlanta and Chicago with two different colo providers running on 2 SuperMicro SYS-5039MD8-H8TNR in each datacenter. Most of these nodes either have CentOS 6 or 7 (if we were lucky) and were definitely treated like pets. Each node had it's own name and we tried to spread the various services out on each node as much as possible but everything was manually deployed so that made it difficult. We didn't use any Puppet or Ansible type tools for deployments so as you can imagine, new version rollouts were a pain.
When it came time to rollout new versions of services (about twice per week) we called it "rituals". There was a mental checklist of back-flips to go through and servers to ssh into in order to get everything rolled out. It was painful enough repeating the same steps over and over, but imagine having to manually dry up calls on Kamailio/Asterisk and then waiting for that to happen before upgrading components.. This also wasn't something that could be done during the day when traffic volumes were high because we didn't keep a huge amount of overage compute capacity on stand-by.
When the first Kubernetes cluster was stood up in GCP, we partied like 1999 🎉 .
However, none of the services/apps were actually talking to each other properly so nothing really worked 👎 . The biggest issue was the networking differences in Kubernetes vs a bare metal environment. In our bare metal environment all of the services used either DNS records or static ip's to communicate with one another. I know I know... Part of the reason for moving to a Kubernetes environment was to be able to take advantage of an actual service discovery and mesh.
Back to the drawing board
It became obvious that some low level design changes needed to be made in order to have a successful deployment in a single region, let alone across multiple clusters in geographically diverse regions. This is when we were introduced to NATS. What a product.. Like Kafka or Redis but more modern and easier.
At this point I was faced with a decision to choose between two different paths forward:
Fork/branch our codebase and begin building a Kubernetes specific implementation of our environment using NATS and leave the existing (legacy) codebase working in the bare metal environment until we fully migrate to Kubernetes
Refactor our legacy bare metal environment to use NATS for pub/sub and service discovery and commit to making the necessary changes in the same codebase being deployed in both bare metal and Kubernetes (no matter what else we may come across in the future).
Just quit because this is starting to look like a never ending project
#3 was looking like a pretty good choice but #2 came out on top for a few reasons:
I knew the bare metal environment had a shelf-life
I had no real idea when we would actually be able to migrate fully which would have left a huge gap in the two codebases
It was still yet to be determined what other core components needed to be re-designed
Wanted to make sure that the same (as close as possible) codebase and plumbing was being used in both bare metal and Kubernetes
I knew that eventually we would have to wire up both bare metal and Kubernetes environment networking in order to migrate safely and I knew that wouldn't have been possible with two different codebase implementations
So we spent a good bit of time refactoring various services to make use of NATS and Consul in our environment. This caused us to look at other design flaws, which lead to more refactoring as we learned more about Kubernetes in general. These additions prepared our codebase for a more frictionless move to Kubernetes.
Deploying the first working cluster
After making it through the refactor war-zone and out the other side, it was time for another Kubernetes deploy. This time around felt much more like 🦄 and 🌈 . Most web, voice, video, and messaging features worked as expected. There was however a few big items that needed to be addressed:
Failure recovery
Graceful shutdowns
Centralized log handling
Horizontal Pod Autoscaler
Config CPU/Memory threshold tweaking
Early on, I committed that if we were going to migrate to Kubernetes then we must let Kubernetes do what it does instead of trying to make it operate a certain way. What I mean is, it would have been easier to migrate sooner to Kubernetes if we treated it as just another hosting environment and turned off all of it's nice features and automation. But what's the point??
Kubernetes is really great at state management but that means it also has to make a lot of decisions when certain conditions are met or not met.
At this point our codebase was running on both Kubernetes and bare metal. We had to conditionally make decisions about certain things like "when to run graceful shutdown procedures", and "whether or not to enable service discovery" which wasn't to bad to deal with.
Wiring up both environments
In order to keep the databases in sync (using async MySQL replication), and pub/sub working between bare metal and Kubernetes, we had to connect the two networks. At first we looked at the GCP VPN service but our back-end VoIP services really needed direct IP access (for media processing and SIP signaling). If we wanted to not have to make network changes to our existing bare metal environment it would require us to use something like Wireguard. Here are a few options we considered:
Tailscale (Wireguard with some nice benefits)
Wireguard
Zerotier
Tailscale FTW because it allowed us to setup a subnet gateway between each datacenter with advertised routes and the ability to disable ip masquerading. This gave us what we needed in order to get direct ip access (using ip routes on the bare metal side and a few VM's on the GCP side).
Production deployments
After successfully connecting both environments and proving that all features/functionality worked in both environments (and between environments), we re-deployed fresh clusters in GKE to make sure that the install script worked without issues. After the new clusters were deployed in us-south, us-east, and us-central, we began migrating certain customer activity. The flow of the customers call traffic was like this:
Handling the SIP INVITE in bare metal
Load balancing the traffic to a specific GKE region of Asterisk instances
After we got more comfortable with GKE handling most of the initial customer traffic we loaded more customer traffic to the point where our Horizontal Pod Autoscalers started responding to increase in load within the Kubernetes environment (bringing up new instances and tearing old ones down). This behavior revealed more areas of missed configuration that needed to be sorted out in order for more graceful startups and shutdowns to take place.
Again, we had to think about everything much different has before. We had to test and plan for ALL component failures at any time and combinations for different components failing and scaling at the same time. This is where we made good use of having sidecar apps attached to all major components within the system.
The final countdown 💣
After successfully migrating most customers and leaving them on GKE for ~4 months we felt comfortable enough to finalize the migration by swapping DNS records. Thankfully there was just 1 for Web/API, and 1 for SIP (along with respective SRV records).
We migrated the Web DNS to our Global LB first because it wasn't hugely business impactful if something were to go wrong calls could still take place.
After success migrating the Web DNS, we migrated the SIP DNS. In GKE you can do 1-1 NAT but you can't and shouldn't use network LB's for SIP proxying. They are not SIP aware (although Envoy proxy is currently working on this). Swapping the SIP DNS records was a fairly small impact and we could visualize in Grafana the registered endpoints going away from the bare metal environment and getting added to the GKE environment. Also, now that we were taking advantage of Google Cloud DNS Geo aware records, the endpoint traffic was being loaded fairly evenly in each datacenter (based on where our customers are spread out over the US). This was such a delight. 🎵
Closing
After a few months of a successful behind us, we de-commissioned old bare metal servers and un-wired the Tailscale networking from both sides. This was a bit nerve-racking. It felt like taking the training wheels off. There was a sense of relief and celebration but with a removal of familiarity. We are now a year and a half into a successful Kubernetes implementation and I love sharing and talking about our struggles with anyone interested in building in Kubernetes.
Here's a 45 min talk with Luca discussing our stack if you are interested in learning more.