From eb19c72b90a5f5cd1f53d45d78dba8603193bffe Mon Sep 17 00:00:00 2001 From: Martin Montes Date: Thu, 3 Aug 2023 21:25:32 +0200 Subject: [PATCH] Add troubleshooting section in GALERA.md --- .github/ISSUE_TEMPLATE/bug.md | 4 + docs/GALERA.md | 190 +++++++++++++++++++++++++++++++++- 2 files changed, 189 insertions(+), 5 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/bug.md b/.github/ISSUE_TEMPLATE/bug.md index 53579684..633fa4bc 100644 --- a/.github/ISSUE_TEMPLATE/bug.md +++ b/.github/ISSUE_TEMPLATE/bug.md @@ -22,6 +22,10 @@ Tip: you can use for code blocks of your kubectl output or YAML files. --> +**Is the bug related to Galera?** +- [ ] I have checked the [Galera documentation](../../docs/GALERA.md) +- [ ] I have checked the [Galera troubleshooting guide](../../docs/GALERA.md#troubleshooting) + **Expected behaviour** diff --git a/docs/GALERA.md b/docs/GALERA.md index a571f857..54a45d55 100644 --- a/docs/GALERA.md +++ b/docs/GALERA.md @@ -1,4 +1,4 @@ -## ✨ High availability via Galera +# ✨ High availability via Galera The `mariadb-operator` provides cloud native support for provisioning and operating multi-master MariaDB clusters using Galera. This setup enables the ability to perform both read and write operations on all nodes, enhancing availability and allowing scalability across multiple nodes. @@ -6,13 +6,13 @@ In certain circumstances, it could be the case that all the nodes of your cluste To accomplish this, after the MariaDB cluster has been provisioned, `mariadb-operator` will regularly monitor the cluster's status to make sure it is healthy. If any issues are detected, the operator will initiate the [recovery process](https://galeracluster.com/library/documentation/crash-recovery.html) to restore the cluster to a healthy state. During this process, the operator will set status conditions in the `MariaDB` and emit `Events` so you have a better understanding of the recovery progress and the underlying activities being performed. For example, you may want to know which `Pods` were out of sync to further investigate infrastructure-related issues (i.e. networking, storage...) on the nodes where these `Pods` were scheduled. -### Components +## Components To be able to effectively provision and recover MariaDB Galera clusters, the following components were introduced to co-operate with `mariadb-operator`: - **[🍼 init](https://github.com/mariadb-operator/init)**: Init container that dynamically provisions the Galera configuration file before the MariaDB container starts. Guarantees ordered deployment of `Pods` even if `spec.podManagementPolicy = Parallel` is set on the MariaDB `StatefulSet`, something crucial for performing the Galera recovery, as the operator needs to restart `Pods` independently. - **[🤖 agent](https://github.com/mariadb-operator/agent)**: Sidecar agent that exposes the Galera state ([`grastate.dat`](https://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster/)) via HTTP and allows one to remotely bootstrap and recover the Galera cluster. For security reasons, it has authentication based on Kubernetes service accounts; this way only the `mariadb-operator` is able to call the agent. -### Configuration +## Configuration The easiest way to get a MariaDB Galera cluster up and running is setting `spec.galera.enabled = true`, like in this [example](../examples/manifests/mariadb_v1alpha1_mariadb_galera_minimal.yaml): @@ -74,7 +74,7 @@ spec: Refer to the [API Reference](#api-reference) below to better understand the purpose of each field. -### API Reference +## API Reference - [Go API pkg](https://pkg.go.dev/github.com/mariadb-operator/mariadb-operator@v0.0.16/api/v1alpha1#Galera) - [Code](../api/v1alpha1/mariadb_galera_types.go) - **`kubectl explain`** @@ -126,7 +126,7 @@ FIELDS: Once this timeout is reached, the Pod is restarted. ``` -### Quickstart +## Quickstart Let's see how `mariadb-operator`🦭 and Galera✨ play together! First of all, install the following configuration manifests that will be referenced by the CRDs further: ```bash @@ -212,3 +212,183 @@ mariadb-galera True Running All 82m ``` To conclude, it's important to note that the Galera functionallity is 100% compatible with the rest of `mariadb-operator` constructs: `Backup`, `Restore`, `Connection`... refer to the [main quickstart guide](../README.md#quickstart) for more detail. + +## Troubleshooting + +The aim of this section is showing you how to diagnose your Galera cluster when something goes wrong. In this situations, observability is a key factor to understand the problem, so we recommend following these steps before jumping into debugging the problem. + +- Make sure network connectivity is fine by checking that you have an `Endpoint` per `Pod` in your Galera cluster. +```bash +kubectl get endpoints mariadb-galera-internal -o yaml +apiVersion: v1 +kind: Endpoints +metadata: + name: mariadb-internal +subsets: +- addresses: + - hostname: mariadb-1 + ip: 10.255.140.181 + nodeName: k8s-worker-1 + targetRef: + kind: Pod + name: mariadb-1 + namespace: mariadb + - hostname: mariadb-2 + ip: 10.255.20.156 + nodeName: k8s-worker-2 + targetRef: + kind: Pod + name: mariadb-2 + namespace: mariadb + - hostname: mariadb-0 + ip: 10.255.214.164 + nodeName: k8s-worker-0 + targetRef: + kind: Pod + name: mariadb-0 + namespace: mariadb + ports: + - name: sst + port: 4568 + protocol: TCP + - name: ist + port: 4567 + protocol: TCP + - name: mariadb + port: 3306 + protocol: TCP + - name: agent + port: 5555 + protocol: TCP + - name: cluster + port: 4444 + protocol: TCP + +``` +- Check the events associated with the `MariaDB` object, as they provide significant insights for diagnosis, particularly within the context of cluster recovery. +```bash +kubectl get events --field-selector involvedObject.name=mariadb-galera --sort-by='.lastTimestamp' +LAST SEEN TYPE REASON OBJECT MESSAGE +... +16m Warning GaleraClusterNotHealthy mariadb/mariadb-galera Galera cluster is not healthy +16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-2' +16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-1' +16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-0' +16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-1' +16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-2' +17m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-0' +17m Normal GaleraClusterBootstrap mariadb/mariadb-galera Bootstrapping Galera cluster in Pod 'mariadb-galera-2' +20m Normal GaleraClusterHealthy mariadb/mariadb-galera Galera cluster is healthy +``` + +- Enable `debug` logs in `mariadb-operator`. + +```bash +helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator --set logLevel=debug +kubectl logs mariadb-operator-546c78f4f5-gq44k +{"level":"info","ts":1691090524.4911606,"logger":"galera.health","msg":"Checking Galera cluster health","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e"} +{"level":"debug","ts":1691090524.4911761,"logger":"galera.health","msg":"StatefulSet ready replicas","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e","replicas":1} +``` + +- Get the logs of all the `MariaDB` `Pod` containers, not only of the main `mariadb` container but also the `agent` and `init` ones. + +```bash +kubectl logs mariadb-galera-0 -c init +{"level":"info","ts":1691090778.5239124,"msg":"Starting init"} +{"level":"info","ts":1691090778.5305626,"msg":"Configuring Galera"} +{"level":"info","ts":1691090778.5307593,"msg":"Already initialized. Init done"} + +kubectl logs mariadb-galera-0 -c agent +{"level":"info","ts":1691090779.3193653,"logger":"server","msg":"server listening","addr":":5555"} +2023/08/03 19:26:28 "POST http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 58B in 4.112086ms +2023/08/03 19:26:28 "DELETE http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 0B in 883.544µs + +kubectl logs mariadb-galera-0 -c mariadb +2023-08-03 19:27:10 0 [Note] WSREP: Member 2.0 (mariadb-galera-0) synced with group. +2023-08-03 19:27:10 0 [Note] WSREP: Processing event queue:...100.0% (1/1 events) complete. +2023-08-03 19:27:10 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6) +2023-08-03 19:27:10 2 [Note] WSREP: Server mariadb-galera-0 synced with group +2023-08-03 19:27:10 2 [Note] WSREP: Server status change joined -> synced +2023-08-03 19:27:10 2 [Note] WSREP: Synchronized with group, ready for connections +``` + +Once you are done with these steps, you will have the context required to jump ahead to the [Common errors](#common-errors) section to see if any of them matches your case. If they don't, feel free to open an issue or even a PR updating this document if you managed to resolve it. + +### Common errors + +#### Permission denied writing Galera configuration + +```bash + Error writing Galera config: open /etc/mysql/mariadb.conf.d/0-galera.cnf: permission denied +``` +This error is returned by the `init` container when it is unable to write the configuration file in the filesystem backed by the PVC. In particular, this has been raised by users using longhorn and rook as a storage provider, which in some cases rely on root privileges for writing in the PVC: +- https://github.com/longhorn/longhorn/issues/3549 + +The remediation is running as root or match the user expected by the storage provider to be able to write in the PVC: + +```yaml +apiVersion: mariadb.mmontes.io/v1alpha1 +kind: MariaDB +metadata: + name: mariadb-galera +spec: + podSecurityContext: + runAsUser: 0 +... +``` + +#### Unauthorized error disabling bootstrap + +```bash +Error reconciling Galera: error disabling bootstrap in Pod 0: unauthorized +``` +This situation occurs when the `mariadb-operator` credentials passed to the `agent` as authentication are either invalid or the `agent` is unable to verify them. To confirm this, ensure that both the `mariadb-operator` and the `MariaDB` `ServiceAccounts` are able to create `TokenReview` objects: + +```bash +kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-operator | grep tokenreview +tokenreviews.authentication.k8s.io [] [] [create] + +kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-galera | grep tokenreview +tokenreviews.authentication.k8s.io [] [] [create] +``` + +If that's not the case, check that the following `ClusterRole` and `ClusterRoleBindings` are available in your cluster: +```bash +kubectl get clusterrole system:auth-delegator +NAME CREATED AT +system:auth-delegator 2023-08-03T19:12:37Z + +kubectl get clusterrolebinding | grep mariadb | grep auth-delegator +mariadb-galera-auth:auth-delegator ClusterRole/system:auth-delegator 108m +mariadb-operator:auth-delegator ClusterRole/system:auth-delegator 112m +``` +`mariadb-operator:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-operator` `ServiceAccount` which is created by the helm chart, so you can re-install the helm release in order to recreate it: + +```bash + helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator +``` + +`mariadb-galera-auth:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-galera` `ServiceAccount` which is created on the flight by the operator as part of the reconciliation logic. You may check the `mariadb-operator` logs to see if there are any issues reconciling it. + +#### Timeout waiting for Pod to be Synced + +```bash +Timeout waiting for Pod 'mariadb-galera-2' to be Synced +``` +This error appears in the `mariadb-operator` logs when a `Pod` is in non synced state for a duration exceeding the `spec.galera.recovery.podRecoveryTimeout`. Just after, the operator will restart the `Pod`. + +Increase this timeout if you consider that your `Pod` may take longer to recover. + +#### Galera cluster bootstrap timed out + +```bash +Galera cluster bootstrap timed out. Resetting recovery status +``` +This is error is returned by the `mariadb-operator` after exceeding the `spec.galera.recovery.clusterBootstrapTimeout` when recovering the cluster. At this point, the operator will reset the recovered sequence numbers and start again from a clean state. + +Increase this timeout if you consider that your Galera cluster may take longer to recover. + +### GitHub Issues + +Here it is a list of GitHub issues reported by `mariadb-operator` users which might shed some light in your investigation: +- https://github.com/mariadb-operator/mariadb-operator/issues?q=is%3Aclosed+label%3Agalera-troubleshoot+ \ No newline at end of file