Serving helm charts in s3 with envoy

replacing chartmuseum with envoy

helm charts from AWS S3 with envoy

So this quarter's big goal was finops, and someone came up with the excellent :nerdsnipe: of: "our s3 bucket we use to serve helm charts costs us $20k / month". First off: wtf. Second: this looks like an easy win.

chartmuseum?

We publish helm charts using helm-s3 to an S3 bucket, and serve them locally in-cluster with chartmuseum. So what could be wrong?

Turns out that chartmuseum dynamically generates its index.yaml on every request, we've disabled writing back generated indexes, and set a very low cache time, so every time, chartmuseum downloads the entire bucket to generate a new index.yaml. This also meant that our runbooks for yanking a published chart version, and our ci processes of helm s3 reindex was wrong, since that modified a stored index.yaml that wasn't used.

Thinking some more, I realized we didn't need dynamic index generation, and we weren't using any of chartmuseum's features besides turning http calls into S3 API calls. I could easily replace this with something much simpler. And so I wrote up an envoy config (see below).

mystery

The next day, as I prepared to roll out my envoy config, I noticed that our request volume to the bucket had dropped off a cliff since the previous afternoon, but there was still a high baseline. As far as I knew, nothing had changed: no chartmuseum config changes in the last month, ArgoCD was last changed a week ago, and deployment script changes were just adding logging. We have a mystery on our hands.

Thankfully, another engineer had looked into our bucket costs a month ago, and turned on S3 Server Access Logs, so I went ahead and dumped 3 days of logs to compare (this took forever....). First, verify who was making the calls: it was the IAM User for chartmuseum. Second, a peek at their User Agent: it went from aws-sdk-go v1.37.28 to aws-sdk-go v1.44.288, which was version upgrade we did, but should have rolled out a month ago. This was suspicious. Looking at one of our dev clustres, nothing seemed amiss: the upgrade happened a month ago as expected. Looking at one of our prod clusters, I noticed the rollout happened when our requests dropped. I looked into our CI pipeline and saw: our upgrade rolled out to dev, but there was a manual approval for prod deployments, and nobody clicked approve until recently (a month after the dev rollout...). So that was one mystery solved.

But we still had aws-sdk-go v1.37.28 requests even after everything was supposedly upgraded. The next field I had was the source IP addresses in the access logs. Poking around a bit, I realized it was the NAT gateways for one of our dev clusters. Did I have phantom pods running somewhere? I looked everywhere but didn't see anything. Running out of ideas, I logged into the AWS Console, opened the EC2 dashboard, and was greeted with a suspiciously high number of instances. Looking at the instance groupds, I realized we had a second cluster running, part of a failed experiment with blue-green cluster upgrades. Oh. That's where the requests were coming from. I quizzed the team on their plans for the cluster (kill it off later), and upgraded chartmuseum in there.

envoy

So, that envoy config. Using envoy versions 1.26 - 1.28 with the key being the AwsRequestSigning and CacheFilter. RFCF (Response From Cache Filter) in the access log response flags indicated a cache hit.

 1admin:
 2  address:
 3    socket_address:
 4      address: 0.0.0.0
 5      port_value: 8081
 6
 7static_resources:
 8  listeners:
 9    - address:
10        socket_address:
11          address: 0.0.0.0
12          port_value: 8080
13      filter_chains:
14        - filters:
15            - name: envoy.filters.network.http_connection_manager
16              typed_config:
17                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
18                stat_prefix: charts
19                access_log:
20                  - name: envoy.access_loggers.stdout
21                    typed_config:
22                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
23                http_filters:
24                  - name: envoy.filters.http.cache
25                    typed_config:
26                      "@type": type.googleapis.com/envoy.extensions.filters.http.cache.v3.CacheConfig
27                      typed_config:
28                        "@type": type.googleapis.com/envoy.extensions.http.cache.file_system_http_cache.v3.FileSystemHttpCacheConfig
29                        manager_config:
30                          thread_pool:
31                            thread_count: 1
32                        cache_path: /tmp/envoy-cache
33                        max_cache_size_bytes: 2147483648 # 2 GiB
34                  - name: envoy.filters.http.aws_request_signing
35                    typed_config:
36                      "@type": type.googleapis.com/envoy.extensions.filters.http.aws_request_signing.v3.AwsRequestSigning
37                      service_name: s3
38                      region: SOME_AWS_REGION
39                      use_unsigned_payload: true
40                      host_rewrite: SOME_BUCKET_NAME.s3.SOME_AWS_REGION.amazonaws.com
41                      match_excluded_headers:
42                        - prefix: x-envoy
43                        - prefix: x-forwarded
44                        - exact: x-amzn-trace-id
45                  - name: envoy.filters.http.router
46                    typed_config:
47                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
48                route_config:
49                  name: s3
50                  virtual_hosts:
51                    - name: all
52                      domains:
53                        - "*"
54                      routes:
55                        # don't cache index
56                        - match:
57                            path: /index.yaml
58                          route:
59                            cluster: s3_clusters
60                        # don't cache dev charts
61                        - match:
62                            safe_regex:
63                              google_re2: {}
64                              regex: ".*-dev-.*"
65                          route:
66                            cluster: s3_clusters
67                        # cache everything else
68                        - match:
69                            prefix: /
70                          route:
71                            cluster: s3_clusters
72                          response_headers_to_add:
73                            - header:
74                                key: Cache-Control
75                                value: max-age=86400 # 1 day
76                              append_action: OVERWRITE_IF_EXISTS_OR_ADD
77
78  clusters:
79    - name: s3_clusters
80      type: LOGICAL_DNS
81      connect_timeout: 5s
82      dns_lookup_family: V4_ONLY
83      load_assignment:
84        cluster_name: s3_cluster
85        endpoints:
86          - lb_endpoints:
87              - endpoint:
88                  address:
89                    socket_address:
90                      address: SOME_BUCKET_NAME.s3.SOME_AWS_REGION.amazonaws.com
91                      port_value: 443
92      transport_socket:
93        name: envoy.transport_sockets.tls
94        typed_config:
95          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext