We’ve been running a bit of a baking-off of observability tools for the past few years. Over the last couple months we integrated Google Cloud Monitoring (formerly Stackdriver Metrics) to track custom application metrics. The impetus for this was two-fold:
Reduced cognitive load: it’s baked into Google Cloud so it’s theoretically one less tool to have (although GCP Console is so vast I think of it as many tools)
Reduced operational costs
The upshot is that it is inordinately complicated to get working (both the engineering and the visualizations) and I don’t think the investment in effort to implement it is worth any savings in on-going costs. Even though Stackdriver was founded in 2012, the tooling feels really basic (likely because of the long integration time into Google Cloud) compared to competitors. The off-the-shelf reports are very high-level, and the interface is buggy and slow.
On Thursday evening I got an alert that one of our server’s outbound network traffic exceeded normal level. By about 100-fold.
Why do we have these alerts? For security. I want to know if inbound or outbound traffic is too high. For inbound traffic, this could be the sign of a DoS attack. For outbound traffic this could indicate that one of our servers has been compromised and is now being used to send spam, distribute malware, or exfiltrate data.
I won’t get into the investigation here (the server wasn’t compromised). Instead I want to talk about how we’ve reduced the risk of a compromised server by harnessing the resiliency we designed into our operations. We can just blow everything away and start from clean images. This is a really big hammer; but, because of in-built resiliency, it’s non-disruptive.
We run a Rails application as Docker containers in Kubernetes. Our application and services have a fair number of scheduled tasks and before we moved to containers these ran using cron in the VM that the server ran on. When we moved to Docker we first migrated by deploying a container that just ran our cron jobs. We’ve now migrated those to native Kubernetes Cron jobs which vastly improved the resiliency of our system.
We recently moved several of our projects to the new Google Cloud Build for building container images and pushing them to the repository. It’s a pretty simple system (not a full CI) but it does the job well, and I liked having the “build” part separate from the “run tests” part of the toolchain. That said, I feel like this is among the many tools that leave me writing bash scripts in YAML.
We recently upgraded many of our services to Rails 5.2 which installs bootsnap to make start-up faster. However, bootsnap depends on caching data to the local file system and our production containers run with read-only file systems for security. So I decided to remove bootsnap in production:
group :development, :test do
gem 'bootsnap', '~>; 1.3'
require "bundler/setup" # Set up gems listed in the Gemfile.
require "bootsnap/setup" # Speed up boot time by caching expensive operations.
# bootsnap is an optional dependency, so if we don't have it it's fine
# Do not load in production because file system (where cache would be written) is read-only
Last year we changed our EC2 system from long-running instances to on-demand, spot request instances. This reduced our EC2 bill by 98%. It also ensured that every instance was built with the latest image and security patches and ran only as long as needed.
We run lots of background jobs and background workers. Some of these are pretty consistent load and some vary greatly. In particular we have a background process that can consume 30GB or more of memory and run for over a day. For smaller workloads it could complete in 15 minutes (consuming a lot less memory). At other times this queue can be empty for days.
Traditionally Resque is run with one or more workers monitoring a queue and picking up jobs as they show up (and as the worker is available). To support jobs that could scale to 30GB that meant allocating 30GB per worker in our cluster. We didn’t want to allocate lots of VMs to run workers that might be idle much of the time. In fact we didn’t really want to allocate any VMs to run any workers when there were no jobs in the queue. So we came up with a solution that uses Kubernetes Jobs to run Resque jobs and scale from zero.
We determined that those whole-web or industry-wide CTR-by-rank charts that many marketers use to predict performance have little bearing on their specific site or topic.
Bottom line? We found that averages, even when segmented by query type, didn’t provide much actionable data for a specific site. When we compared averages to site-specific data, we didn’t find much that was similar.
However, we did find that average click through rates within a site tended to hold fairly steady, and so using actual averaged click through rates for your own site can be very useful for things like calculating market opportunity of new content investments, estimating impact of rankings changes, and pinpointing issues with the site’s listings on the search results page.
Over at Keylime Toolbox, we have a feature that lets you test filter patterns against your query data. To make it “fast” we limit the data to the most recent day of data. But this can sometimes be 50,000 or more queries. So when rendering all those into a list (with some styling, of course), it would make the browser unresponsive for a time and sometimes crash it.
After hours of debugging and investigating options, I finally fixed this by limiting the number we render to start with and then adding “infinite scroll” to the lists to add more items as you scroll.
We were talking the other day at Ada Developers Academy about whether StackOverflow has an increasing barrier to entry. In order to play you have to answer a question. This was easier when it first started (and when it first started I had lots of experience). In the last few years many, many of the beginner questions are already answered. And for beginners, it’s even harder to find questions you can answer. The students I was talking to were frustrated by this.
At the same time, I was lamenting the fact that the few questions I’ve asked don’t get answered. Maybe because I only ask questions I can’t find the answer to and that are really hard. Or maybe they are uninteresting.
So I checked and discovered that this morning StackOverflow lists 1,687,405 unanswered questions of 7,016,543 total. That’s 25% unanswered questions! This really surprised me.
So I’m sure there are questions for beginning programmers just by the shear volume of questions. Finding them may be a challenge.
There’s an interesting post on meta.stackoverflow.com where the top responders seem to think that the quality of questions is getting lower and more repetitive. One responder to this meta questions decried the questions on SO as being “answered 100 times, or is a “do my work for me” question.” So from their perspective the reason 25% are unanswered is because they have been answered before or the requester hasn’t done sufficient pre-work to warrant asking a question.