job1 retro

looking back on what I did for the past 15 months

SEAN K.H. LIAO

job1 retro

looking back on what I did for the past 15 months

retrospective

Looking back, (re)evaluating the choices made, and plotting the path forward.

Impact

In the final days, I counted my code contributions as: +193k lines, -278k lines over 590 Pull Requests. Does it mean anything? Probably not. I consider GitOps and the supporting automation to be most interesting feature work that I completed and will have lasting impact, while the cleanup of technical debt, and related standardization work to be the equivalent in the more mundane department.

Infrastructure

The platform on which all the business logic runs.

IaC Terraform

Terraform is halfway between code and data. Not powerful enough to stuff in all the logic you would want, yet not declarative enough to processed by external tooling efficiently. Our tool of choice was Atlantis run on Github Pull Requests, but this would occasionally result in partial applies and entangled messes of locks. And while the infrastrcture description as resources was more or less declarative, the apply process was imperative and run on demand only, resulting in state drift that would need to be reconciled manually.

Your experience with terraform depends a lot on the maturity of the actual providers, some, like cloudflare and google are more or less stable, while others, like github, confluent, and opsgenie will leave you wondering why you bothered at all. It's also important to understand terraform's dependency chains, lest you end up with what appears to be an elegant setup that only works because you imported existing resources and can't actually be applied from scratch.

GCP's Config Connector would have been an interesting thing to look at, solving config drift and data vs code. If only other providers could work as Kubernetes controllers.

IaC Kubernetes

Kubernetes was our deployment platform of choice, specifically Google Kubernetes Engine. I think I did a lot of the grunt work of migrating from a weird Ansible setup that only updated manifests if you deployed it manually, but not through CI, to one that used Helm and Makefiles for standardization, to eventually having the rendered charts continuously reconciled by ArgoCD.

A story I used a lot in interviews was how annoyed I was when I learned that I had to: update a lot of components from 2+ years ago and do the deploys at night (where was my work life balance??). Anyway, the solution was to do the things that hurt more often, so continuous dependency updates by renovate and continuous deploys by ArgoCD.

Even after writing the standard helm chart within the company (made possible by the applications following 12 Factor App principles), I still think Helm is a horrible technology. The Kubernetes Resource Model is one of structured data, yet Helm treats it an unstructured text with no attempt to validate even the output yaml for syntax correctness. The Declarative application management design proposal details other issues.

I still consider the branching strategy we used to be weird, but was never motivated enough to push for any change. While if we had done the whole thing a year later I may have opted for GCP Anthos Config Management / Config Sync.

IaC Self Service

We tried to make the IaC setup self service for the development teams, however this would regularly run into the problems of: finding the correct config to add/modify, lack of documentation for best practices, uncertainty around the config correctness without applying. For established workflows, often pointing to an existing example to copy/modify was enough, but newer things required handholding. With documentation in 4 potential locations (next to code, doc site, confluence, upstream), finding the correct information was never easy.

I really wanted to follow the documentation system: tutorials, how-tos, explanations, reference. But doing so approached a full time job in keeping things up to date... (maybe we managed too much).

Continuous Integration / Continuous Deployment

With CI somewhere in our squad name, this more or less expectedly took up a large part of my time.

Jenkins

Jenkins is best described as a generic task runner, and with enough plugins you can make it do anything, including being a CI/CD system. When I joined there was an initiative to replace it with something else, so I more or less avoided touching it for a few months. After understanding how it was used, my opinion was that it was too entrenched to be replaced at that point in time, and it was better to break it up into more manageable pieces, eg, handing off CD to ArgoCD

I then spent quite some time refactoring the existing Jenkins setup, seeing as we were stuck with it, might as wwell make it nicer, eg by standardizing jobs, making the setup delcarative, and making the inline scripts testable. Maybe it ended up a bit too nice and now it's even more entrenched...

I think Tekton came the closest in terms of pure capability, but the management layer on top wasn't polished enough, and devs like their UIs (also why we went with ArgoCD over Flux).

Observability

This honestly hurt the most, in day to day pain points, and in areas we made little procgress in.

Logging

Fluentd feeding into Elasticsearch. Both unstable resource hogs.

Who had the bright idea of adding plugins to do log enrichment in fluentd? It should have remained simple and just been responsible for collect, parse/transform, send. That way there would have been more options to replace it with something lighter weight, eg fluentbit or OpenTelemetry Collector.

Despite running Elastic Cloud's managed Elasticsearch, it barely counted as managed and would regularly fail. I just wanted to store and search logs, there were so many better options...

Metrics

Prometheus was mostly stable, except when it would occasionally fail to lose data, oops. The frustrating part was the inconsistencies in the metrics generated by applications, making using the metrics for dashboarding or alerting a pain.

Tracing

For the short while that my PoC ran, I actually had visibility into the service architecture, longstanding performance chokepoints, and what amounted to very fine grained metrics. Sadly, not enough buy-in for it to continue.

Alerting

From what I could see, every time something went wrong, someone added an alert. What do you actually do when an alert fires? who knows. Rhe runbooks were often empty, or contained non actionable items. If i have 30 alerts going off every time a component went down, which instruction set do I follow? Do I even have a way to diagnose which part broke? Also, there never felt like there was a go-to place for communicating the progress/status of ongoing incidents.

People and Process

Tickets and Pull Requests

I got feedback for "create more tickets", "fill out ticket sections", "tickets are too prescriptive", "tickets aren't prescriptive enough". I probably could have just copied the PR descriptions I wrote into a ticket, maybe it would have helped my team member track their progress. I myself just had a Github project board for it all...

Communication

Just before I left, Microsft Teams added a compact layout, making it slightly more usuable in actually seeing information in a chat. It also made scroll lag worse, search was still mostly useless, and why did it need 2 different tabs.

As fore email, we were often CC'd in on things via an alias, but this meant only the current team members had a copy of the email. I would have preferred if the emails went to an archived email list, which I believe we had access to via Google Groups, this way information wasn't lost as team membership changed.

Helping Other People

I would argue that I often already knew the places people would have problems with, it was just a matter of if I could be motivated to preemptively do anything about it.

I mostly stuck to no hello, if it wasn't important/urgent enough to state the issue upfront, it can wait.

I treated emails as a slower universal chat: my emails were also mostly short, I'm still trying to find the balance of when I can drop the opening "hi" (after a few back and forths it becomes tedious and more like a slower chat service anyway...). As for sigining, I just stick with my name to indicate the end of a message or the longer versions for the people I was less familiar with:

- sean

- sean / $squad / compancy

- sean / $title / company

Organization Chart

I think I complained about this before, but I never got a clear view of who's where in the organization and responsible for what. So the first few months were a lot of: we need the owner of x to do something, who do I ask??

Remote Work

Over the course of 17 months, I went into the office a grand total of 27 times, with the first time meeting my team in person at 5 months in. Productivity wise, remote didn't feel too bad: there was time to focus and ignore people, and I could wake up just minutes before morning standup. But it did mean building interpersonal relations was much harder, and you lose the more informal knowledge sharing, both as the person sharing knowledge and as the one picking things up.

Closing thoughts

I certainly didn't cover everything I did here (cloudflare, security, etc.), and there were also things I really wanted but didn't get to do (zero trust, SLO, etc.). It was a fun, relaxed year and a half, and I'd like to think I left the place better than I found it. Ultimately, I left when the things I thought within reach were done, with the remainder being more rooted in organizational issues than I was prepared to solve.