On Thursday evening I got an alert that one of our server’s outbound network traffic exceeded normal level. By about 100-fold.
Why do we have these alerts? For security. I want to know if inbound or outbound traffic is too high. For inbound traffic, this could be the sign of a DoS attack. For outbound traffic this could indicate that one of our servers has been compromised and is now being used to send spam, distribute malware, or exfiltrate data.
I won’t get into the investigation here (the server wasn’t compromised). Instead I want to talk about how we’ve reduced the risk of a compromised server by harnessing the resiliency we designed into our operations. We can just blow everything away and start from clean images. This is a really big hammer; but, because of in-built resiliency, it’s non-disruptive.
Much of our service runs in a Kubernetes cluster. Using Kubernetes and containers is the first step towards making this “big hammer” work. Because everything runs in a container, from container images, we know that we can replace those running containers with new ones from clean images from our repository. Kubernetes makes this easy because it manages rolling updates (inherently with Deployments, but also managed via Pod Disruption Budgets).
This can be done simply with
kubectl drain to remove pods from the node. If the pods are running under a Deployment, Job, or similar controller, the scheduler will relaunch the pods on another node using the pristine container image. You can then rebuild the node from a known-good image or delete it. In our case, we use autoscaling on our cluster which will allocate new nodes as we remove the old ones.
With Google Kubernetes Engine, if there’s a Kubernetes update available, another way to accomplish this is to tell GKE to update the master and then node pools to the latest Kubernetes version. This ends up replacing every pod and resetting every node.
gcloud container clusters upgrade [CLUSTER_NAME] --master gcloud container clusters upgrade [CLUSTER_NAME] --node-pool=[NODE-POOL-NAME]
We’ve built resiliency into our system for years, particularly around batch data loading. Our own monitoring and orchestration ensures that data processing recovers as soon as possible. Under Kubernetes we adapted our operational architecture for resilience as well. A few things that we’ve done:
- All our services (Deployments in Kubernetes terms) run at least 2 pods so that they can be rolled over to new images, by setting the
apiVersion: apps/v1 kind: Deployment metadata: name: app spec: replicas: 5 ...
- We map a Pod Disruption Budget to Deployments to have control over how much “up” they are when being rolled over
apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: app-pdb spec: minAvailable: 2 selector: matchLabels: purpose: app ...
- We tuned
terminationGracePeriodSecondsto allow things to gracefully wrap up
apiVersion: apps/v1 kind: Deployment metadata: name: worker spec: ... template: metadata: labels: ... spec: terminationGracePeriodSeconds: 600
In addition to the resilience, containers and Google Kubernetes Engine give us some tools to increase security and protect the systems from compromise:
- We use Google’s Container-optimized OS (COS) on nodes, which has an immutable root filesystem, does not allow kernel extensions, and does not execute non-container software (see more security details)
- Our containers store nothing locally and have no local storage (so the running code can’t overwrite the container)
- We set limits on memory and CPU for all containers, managed by Kubernetes — if something is exceeding our expected limits it gets shut down immediately
While we implemented resilience as a practice in our operations to improve our customer experience it has improved our security as well by ensuring that we can replace anything immediately and by constraining our running systems within limits that allow outliers to be terminated and replaced.