So this quarter's big goal was finops,
and someone came up with the excellent :nerdsnipe:
of:
"our s3 bucket we use to serve helm charts costs us $20k / month".
First off: wtf.
Second: this looks like an easy win.
We publish helm charts using helm-s3 to an S3 bucket, and serve them locally in-cluster with chartmuseum. So what could be wrong?
Turns out that chartmuseum dynamically generates its index.yaml
on every request,
we've disabled writing back generated indexes,
and set a very low cache time,
so every time, chartmuseum downloads the entire bucket to generate a new index.yaml
.
This also meant that our runbooks for yanking a published chart version,
and our ci processes of helm s3 reindex
was wrong,
since that modified a stored index.yaml
that wasn't used.
Thinking some more, I realized we didn't need dynamic index generation, and we weren't using any of chartmuseum's features besides turning http calls into S3 API calls. I could easily replace this with something much simpler. And so I wrote up an envoy config (see below).
The next day, as I prepared to roll out my envoy config, I noticed that our request volume to the bucket had dropped off a cliff since the previous afternoon, but there was still a high baseline. As far as I knew, nothing had changed: no chartmuseum config changes in the last month, ArgoCD was last changed a week ago, and deployment script changes were just adding logging. We have a mystery on our hands.
Thankfully, another engineer had looked into our bucket costs a month ago,
and turned on S3 Server Access Logs,
so I went ahead and dumped 3 days of logs to compare (this took forever....).
First, verify who was making the calls: it was the IAM User for chartmuseum.
Second, a peek at their User Agent: it went from aws-sdk-go v1.37.28
to aws-sdk-go v1.44.288
,
which was version upgrade we did, but should have rolled out a month ago.
This was suspicious.
Looking at one of our dev clustres, nothing seemed amiss: the upgrade happened a month ago as expected.
Looking at one of our prod clusters, I noticed the rollout happened when our requests dropped.
I looked into our CI pipeline and saw: our upgrade rolled out to dev,
but there was a manual approval for prod deployments,
and nobody clicked approve until recently (a month after the dev rollout...).
So that was one mystery solved.
But we still had aws-sdk-go v1.37.28
requests even after everything was supposedly upgraded.
The next field I had was the source IP addresses in the access logs.
Poking around a bit, I realized it was the NAT gateways for one of our dev clusters.
Did I have phantom pods running somewhere?
I looked everywhere but didn't see anything.
Running out of ideas, I logged into the AWS Console, opened the EC2 dashboard,
and was greeted with a suspiciously high number of instances.
Looking at the instance groupds,
I realized we had a second cluster running,
part of a failed experiment with blue-green cluster upgrades.
Oh.
That's where the requests were coming from.
I quizzed the team on their plans for the cluster (kill it off later),
and upgraded chartmuseum in there.
So, that envoy config.
Using envoy versions 1.26 - 1.28
with the key being the
AwsRequestSigning
and CacheFilter.
RFCF
(Response From Cache Filter) in the access log response flags indicated a cache hit.
1admin:
2 address:
3 socket_address:
4 address: 0.0.0.0
5 port_value: 8081
6
7static_resources:
8 listeners:
9 - address:
10 socket_address:
11 address: 0.0.0.0
12 port_value: 8080
13 filter_chains:
14 - filters:
15 - name: envoy.filters.network.http_connection_manager
16 typed_config:
17 "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
18 stat_prefix: charts
19 access_log:
20 - name: envoy.access_loggers.stdout
21 typed_config:
22 "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
23 http_filters:
24 - name: envoy.filters.http.cache
25 typed_config:
26 "@type": type.googleapis.com/envoy.extensions.filters.http.cache.v3.CacheConfig
27 typed_config:
28 "@type": type.googleapis.com/envoy.extensions.http.cache.file_system_http_cache.v3.FileSystemHttpCacheConfig
29 manager_config:
30 thread_pool:
31 thread_count: 1
32 cache_path: /tmp/envoy-cache
33 max_cache_size_bytes: 2147483648 # 2 GiB
34 - name: envoy.filters.http.aws_request_signing
35 typed_config:
36 "@type": type.googleapis.com/envoy.extensions.filters.http.aws_request_signing.v3.AwsRequestSigning
37 service_name: s3
38 region: SOME_AWS_REGION
39 use_unsigned_payload: true
40 host_rewrite: SOME_BUCKET_NAME.s3.SOME_AWS_REGION.amazonaws.com
41 match_excluded_headers:
42 - prefix: x-envoy
43 - prefix: x-forwarded
44 - exact: x-amzn-trace-id
45 - name: envoy.filters.http.router
46 typed_config:
47 "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
48 route_config:
49 name: s3
50 virtual_hosts:
51 - name: all
52 domains:
53 - "*"
54 routes:
55 # don't cache index
56 - match:
57 path: /index.yaml
58 route:
59 cluster: s3_clusters
60 # don't cache dev charts
61 - match:
62 safe_regex:
63 google_re2: {}
64 regex: ".*-dev-.*"
65 route:
66 cluster: s3_clusters
67 # cache everything else
68 - match:
69 prefix: /
70 route:
71 cluster: s3_clusters
72 response_headers_to_add:
73 - header:
74 key: Cache-Control
75 value: max-age=86400 # 1 day
76 append_action: OVERWRITE_IF_EXISTS_OR_ADD
77
78 clusters:
79 - name: s3_clusters
80 type: LOGICAL_DNS
81 connect_timeout: 5s
82 dns_lookup_family: V4_ONLY
83 load_assignment:
84 cluster_name: s3_cluster
85 endpoints:
86 - lb_endpoints:
87 - endpoint:
88 address:
89 socket_address:
90 address: SOME_BUCKET_NAME.s3.SOME_AWS_REGION.amazonaws.com
91 port_value: 443
92 transport_socket:
93 name: envoy.transport_sockets.tls
94 typed_config:
95 "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext