messing up tls certs

doing things the hard way and making mistakes

SEAN K.H. LIAO

messing up tls certs

doing things the hard way and making mistakes

tls certs the hard way

For $reasons, I had been manually requesting and updating the TLS certs used in my kubernetes cluster by hand. Specifically, using acme.sh to request wildcard certs from Let's Encrypt via DNS challenge in GCP Cloud DNS.

renewing certs by hand

One uneventful (so far) Saturday, I saw that my certs had a month and a half left, and decided to renew them. Having not written the process down, I searched backwards in shell history (zsh-substring-search is great) for the right command and got some new certs. At the same time, I thought why not use Google Public CA to reduce the monoculture on Let's Encrypt.

1$ gcloud alpha publicca external-account-keys create
2$ acme.sh --register-account --email $EMAIL --server google --eab-kid $PUBLICCA_KID --eab-hmac-key $PUBLICCA_HMAC
3$ acme.sh --server google --ecc --renew --force --dns dns_gcloud --domain '*.liao.dev' --domain '*.ihwa.liao.dev'

This gave me the usual directory of:

1.acme.sh/
2  *.liao.dev/
3    *.liao.dev.cer
4    *.liao.dev.conf
5    *.liao.dev.csr
6    *.liao.dev.key
7    ca.cer
8    fullchain.cer

using new certs

Not remembering what I used last time, I used the cert *.liao.dev.cer and key *.liao.dev.key as the TLS key pair in a server (Envoy Gateway), and it worked, sort of. Chrome happily connected and verified the cert, but when I tried to use cli tools like curl, openss, and step-cli, I would fail to verify the cert:

  1$ curl https://ihwa.liao.dev
  2curl: (60) SSL certificate problem: unable to get local issuer certificate
  3More details here: https://curl.se/docs/sslcerts.html
  4
  5curl failed to verify the legitimacy of the server and therefore could not
  6establish a secure connection to it. To learn more about this situation and
  7how to fix it, please visit the web page mentioned above.
  8
  9$ openssl -connect ihwa.liao.dev:443 </dev/null
 10openssl s_client -connect 127.0.0.1:8443 -servername ihwa.liao.dev < /dev/null
 11CONNECTED(00000003)
 12depth=0 CN = *.liao.dev
 13verify error:num=20:unable to get local issuer certificate
 14verify return:1
 15depth=0 CN = *.liao.dev
 16verify error:num=21:unable to verify the first certificate
 17verify return:1
 18depth=0 CN = *.liao.dev
 19verify return:1
 20---
 21Certificate chain
 22 0 s:CN = *.liao.dev
 23   i:C = US, O = Google Trust Services LLC, CN = GTS CA 1P5
 24   a:PKEY: id-ecPublicKey, 256 (bit); sigalg: RSA-SHA256
 25   v:NotBefore: Nov  4 09:32:23 2023 GMT; NotAfter: Feb  2 09:32:22 2024 GMT
 26---
 27Server certificate
 28-----BEGIN CERTIFICATE-----
 29MIIEqDCCA5CgAwIBAgIRANZ6hF26ru42Dk1AEW+t7eUwDQYJKoZIhvcNAQELBQAw
 30RjELMAkGA1UEBhMCVVMxIjAgBgNVBAoTGUdvb2dsZSBUcnVzdCBTZXJ2aWNlcyBM
 31TEMxEzARBgNVBAMTCkdUUyBDQSAxUDUwHhcNMjMxMTA0MDkzMjIzWhcNMjQwMjAy
 32MDkzMjIyWjAVMRMwEQYDVQQDDAoqLmxpYW8uZGV2MFkwEwYHKoZIzj0CAQYIKoZI
 33zj0DAQcDQgAED+loglA3i/62NqohbPruCDQnjbtNiffzdMipYWrSBqzdgVE60aNn
 34zbsI8PFDhGI/lSHNxu6GXpY0XUu4GKdSm6OCAoswggKHMA4GA1UdDwEB/wQEAwIH
 35gDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwDAYDVR0TAQH/BAIwADAd
 36BgNVHQ4EFgQUFxukPjhCo6SqgU941B8UOgwJx9MwHwYDVR0jBBgwFoAU1fyeDd8e
 37yt0Il5duK8VfxSv17LgweAYIKwYBBQUHAQEEbDBqMDUGCCsGAQUFBzABhilodHRw
 38Oi8vb2NzcC5wa2kuZ29vZy9zL2d0czFwNS9Zcm9wWXhkZnlmNDAxBggrBgEFBQcw
 39AoYlaHR0cDovL3BraS5nb29nL3JlcG8vY2VydHMvZ3RzMXA1LmRlcjAmBgNVHREE
 40HzAdggoqLmxpYW8uZGV2gg8qLmlod2EubGlhby5kZXYwIQYDVR0gBBowGDAIBgZn
 41gQwBAgEwDAYKKwYBBAHWeQIFAzA8BgNVHR8ENTAzMDGgL6AthitodHRwOi8vY3Js
 42cy5wa2kuZ29vZy9ndHMxcDUvazRiRnFycUNBVkkuY3JsMIIBAwYKKwYBBAHWeQIE
 43AgSB9ASB8QDvAHYASLDja9qmRzQP5WoC+p0w6xxSActW3SyB2bu/qznYhHMAAAGL
 44meQV3QAABAMARzBFAiEAzJ7lwFWIIjzDNGMkPjryL3MWd2V1jkp2YYbFNsyOAI4C
 45IHjJ6a5gvz1p770j/+gB6PB9Qmd30922a2ylz2ZEGh6iAHUA7s3QZNXbGs7FXLed
 46tM0TojKHRny87N7DUUhZRnEftZsAAAGLmeQVuwAABAMARjBEAiBrJBSC0vkCyKhs
 47YZQnAFPvf5/W6i8PhjjF9yxVGXBdogIgJY0tSHO5j6qmgK8PtfdDJBw0tFSXuYJn
 48qv43QUazYEcwDQYJKoZIhvcNAQELBQADggEBAK6lM60o3cP6U7ahR+cbZE07JO/b
 498dtrau0d89x8j8+d7/FIhmERzEgLlNGJzMliGxUuXu4RbBbV5U9DRkr2GnC+Pzyk
 501qnpEOKdVQ7o7BzJ3AH/jtJMdJQ1dvaF8Z1NJZb0sj0lvUMoQt5DpSFFRzUO9U7l
 51Km72HxJFPG5JTjr6aYW5WDee/bHbL72hIgLCiUtub5iVPX7mZ2UCEeXU6wdZrK8v
 52ULpu/+vdY2yeHRdakC0DRY0qBSF+7zC9CWt4P8XRIXYj7c4zLdo9b2XXVod/Js8i
 53TII8ZJTUFedv0MOeHGN8ltE7gGjk4auwpFQ17a+CiuNrml8lVsUp9TRKJ5k=
 54-----END CERTIFICATE-----
 55subject=CN = *.liao.dev
 56issuer=C = US, O = Google Trust Services LLC, CN = GTS CA 1P5
 57---
 58No client certificate CA names sent
 59Peer signing digest: SHA256
 60Peer signature type: ECDSA
 61Server Temp Key: X25519, 253 bits
 62---
 63SSL handshake has read 1551 bytes and written 379 bytes
 64Verification error: unable to verify the first certificate
 65---
 66New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
 67Server public key is 256 bit
 68This TLS version forbids renegotiation.
 69Compression: NONE
 70Expansion: NONE
 71No ALPN negotiated
 72Early data was not sent
 73Verify return code: 21 (unable to verify the first certificate)
 74---
 75---
 76Post-Handshake New Session Ticket arrived:
 77SSL-Session:
 78    Protocol  : TLSv1.3
 79    Cipher    : TLS_AES_128_GCM_SHA256
 80    Session-ID: B655E45C351E5376E391FAAE8D33FDE522A049AFCDA9BEA16AA12646312385B6
 81    Session-ID-ctx:
 82    Resumption PSK: 9E5B00208476BC0CB3B19AB46DCD12266A25378AD3CD385D9243E5FA0B0541BD
 83    PSK identity: None
 84    PSK identity hint: None
 85    SRP username: None
 86    TLS session ticket lifetime hint: 604800 (seconds)
 87    TLS session ticket:
 88    0000 - 0d 5a 82 69 f6 2c 8d 43-62 4f ee 99 4b 01 16 51   .Z.i.,.CbO..K..Q
 89    0010 - 4f 3c 1d b0 ef d0 cb c4-26 1a 75 15 c5 10 58 84   O<......&.u...X.
 90    0020 - 9e a0 8b 97 b6 93 ba c1-30 c9 2e 45 24 95 f3 2a   ........0..E$..*
 91    0030 - 11 71 e4 70 19 c6 20 23-8e 5b 5a 60 fa fa 01 be   .q.p.. #.[Z`....
 92    0040 - e6 cc 7b 0e 73 92 62 cd-e8 2f f4 08 7e e5 0b d3   ..{.s.b../..~...
 93    0050 - 8b a0 10 8a 3d 76 dd 81-96 da af 06 54 60 7c 7e   ....=v......T`|~
 94    0060 - 59 3d ce 31 bf ce a0 53-b5                        Y=.1...S.
 95
 96    Start Time: 1699218296
 97    Timeout   : 7200 (sec)
 98    Verify return code: 21 (unable to verify the first certificate)
 99    Extended master secret: no
100    Max Early Data: 0
101---
102read R BLOCK
103DONE
104
105$ step-cli certificate verify https://ihwa.liao.dev
106failed to connect: tls: failed to verify certificate: x509: certificate signed by unknown authority

Now this was confusing, since I was pretty sure I was using the right certs. Testing the certs locally with a simple Go HTTPS server it logged the following, which was even more confusing since bad record mac was an internal error.

12023/11/05 21:03:31 http: TLS handshake error from 127.0.0.1:38672: local error: tls: bad record MAC
22023/11/05 21:05:57 http: TLS handshake error from 127.0.0.1:40788: remote error: tls: bad certificate

Stepping back a bit, I tried to verify the certs directly, which wasn't much more successful:

1$ openssl verify '*.liao.dev.cer'
2CN = *.liao.dev
3error 20 at 0 depth lookup: unable to get local issuer certificate
4error ./tls.crt: verification failed
5
6$ step-cli certificate verify '*.liao.dev.cer'
7failed to verify certificate: x509: certificate signed by unknown authority

Then I thought, maybe I need to pass the CA file:

1$ openssl verify -CAfile ca.cer '*.liao.dev.cer'
2tls.crt: OK
3
4$ step-cli certificate verify --roots ca.cer '*.liao.dev.cer'

When it finally clicked that I needed to use the fullchain cert (fullchain.cer) instead of just the leaf cert.

The actual process consisted of more mistakes, and my mind wandering to: are the root certs on my machine broken/out of date, did acme.sh mess up a cert somehow, and other weird ideas I don't remember.

exposing and revoking

Now that I finally had working certs, time to save them. I run my cluster via GitOps with the OSS version of Config Sync. For secrets, I use isindir/sops-secrets-operator. The workflow consists of creating a SopsSecret custom resource, then encypting it with sops sops -e -i file.yaml (in conjunction with the .sops.yaml config I have to specify keys).

 1apiVersion: isindir.github.com/v1alpha3
 2kind: SopsSecret
 3metadata:
 4  name: wildcard-google
 5  namespace: envoy-gateway-system
 6spec:
 7  secretTemplates:
 8    - name: wildcard-google
 9      type: kubernetes.io/tls
10      stringData:
11        tls.crt: |
12          ...          
13        tls.key: |
14          ...          
15        ca.crt: |
16          ...          

.sops.yaml to only encrypt the data parts, and with 2 age keys: a local admin key, and a remote server key.

1creation_rules:
2  - encrypted_regex: "^(data|stringData)"
3    key_groups:
4      - age:
5          - age14mg08panez45c6lj2cut2l8nqja0k5vm2vxmv5zvc4ufqgptgy2qcjfmuu
6          - age19q63k49upkgc03e8rsvm5c04x09vqvp2g5u2x6fjjap5awvq0u6q25z8xp

I had 2 pairs of cert/keys: from Let's Encrypt and from Google Public CA, which I pushed into git.

I noticed that the sops operator failed to decode the secret, and upon looking into why, I realized it wasn't encrypted. It wouldn't have been so bad if I didn't have a public mirror of my repo.

So now I have the fun task of revoking the exposed secrets. I had issued certs from Google Public CA first, then overwrote the data in acme.sh's config with a second set of certs from Let's Encrypt (since I was testing if it was just Google Trust Services certs that wouldn't verify earlier).

This meant acme.sh --revoke didn't want to work. So I go about downloading certbot, which has the option to revoke using private key / cert pair:

1$ sudo certbot revoke --cert-path tls.crt --key-path tls.key --reason keyCompromise  --server https://dv.acme-v02.api.pki.goog/directory

Later I realized that because I had issued my second set of certs via acme.sh --renew --force, it kept the same private key. So my "unexposed" cert/key were actually exposed. This time I could use acme.sh:

1$ acme.sh --revoke --ecc -d '*.liao.dev'

cert-manager

Now I could start from scratch, and just remember to actually encrypt secrets. But I thought I might as well go through with the automation and setup cert-manager in my cluster. I had initially resisted because the last time I ran it it was during its graduation into 1.0.0 where there were deprecations to work around, but now it's a much more stable project.

Again, I wanted certs from both Let's Encrypt and Google Public CA, and this time, I would test with staging certs first. Let's Encrypt was straightforward to set up, while GCP had a surprise hiding in the footnote where the EAB secret is needs to be generated seprately by switching the api endpoint in config and not flags (instructions)

1$ gcloud config set api_endpoint_overrides/publicca https://preprod-publicca.googleapis.com/
2$ gcloud publicca external-account-keys create
3$ gcloud config unset api_endpoint_overrides/publicca

With this I could finally have my issuers setup:

 1apiVersion: cert-manager.io/v1
 2kind: ClusterIssuer
 3metadata:
 4  name: letsencrypt-staging
 5spec:
 6  acme:
 7    email: acme+letsencrypt@liao.dev
 8    server: https://acme-staging-v02.api.letsencrypt.org/directory
 9    privateKeySecretRef:
10      name: letsencrypt-staging-account
11    solvers:
12      - dns01:
13          cloudDNS:
14            project: ...
15            serviceAccountSecretRef:
16              name: gcp-cert-manager-sa
17              key: key.json
18---
19apiVersion: cert-manager.io/v1
20kind: ClusterIssuer
21metadata:
22  name: google-staging
23spec:
24  acme:
25    email: acme+google@liao.dev
26    server: https://dv.acme-v02.test-api.pki.goog/directory
27    privateKeySecretRef:
28      name: google-staging-account
29    externalAccountBinding:
30      keyID: ...
31      keySecretRef:
32        name: gcp-publicca-staging
33        key: b64MacKey
34    solvers:
35      - dns01:
36          cloudDNS:
37            project: ...
38            serviceAccountSecretRef:
39              name: gcp-cert-manager-sa
40              key: key.json

And certs:

 1apiVersion: cert-manager.io/v1
 2kind: Certificate
 3metadata:
 4  name: google-staging
 5  namespace: envoy-gateway-system
 6spec:
 7  secretName: google-staging-tls
 8  duration: 720h # 30d
 9  renewBefore: 360h # 15d
10  revisionHistoryLimit: 1
11  subject:
12    organizations:
13      - seankhliao
14  privateKey:
15    rotationPolicy: Always
16    algorithm: ECDSA
17    size: 256
18  dnsNames:
19    - "*.liao.dev"
20    - "*.ihwa.liao.dev"
21  issuerRef:
22    name: google-staging
23    kind: ClusterIssuer

Repeat for production with the prod endpoints, and I was finally done for the day.