Add troubleshooting section in GALERA.md

2025-07-24 12:20:20 +00:00 · 2023-08-03 21:25:32 +02:00
parent c6544d0583
commit eb19c72b90
2 changed files with 189 additions and 5 deletions
--- a/.github/ISSUE_TEMPLATE/bug.md
+++ b/.github/ISSUE_TEMPLATE/bug.md
@ -22,6 +22,10 @@ Tip: you can use
 for code blocks of your kubectl output or YAML files.
 -->

+**Is the bug related to Galera?**
+- [ ] I have checked the [Galera documentation](../../docs/GALERA.md)
+- [ ] I have checked the [Galera troubleshooting guide](../../docs/GALERA.md#troubleshooting)
+
 **Expected behaviour**
 <!--A concise description of what you expected to happen.-->

--- a/docs/GALERA.md
+++ b/docs/GALERA.md
@ -1,4 +1,4 @@
-## ✨ High availability via Galera
+# ✨ High availability via Galera

 The `mariadb-operator` provides cloud native support for provisioning and operating multi-master MariaDB clusters using Galera. This setup enables the ability to perform both read and write operations on all nodes, enhancing availability and allowing scalability across multiple nodes.

@ -6,13 +6,13 @@ In certain circumstances, it could be the case that all the nodes of your cluste

 To accomplish this, after the MariaDB cluster has been provisioned, `mariadb-operator` will regularly monitor the cluster's status to make sure it is healthy. If any issues are detected, the operator will initiate the [recovery process](https://galeracluster.com/library/documentation/crash-recovery.html) to restore the cluster to a healthy state. During this process, the operator will set status conditions in the `MariaDB` and emit `Events` so you have a better understanding of the recovery progress and the underlying activities being performed. For example, you may want to know which `Pods` were out of sync to further investigate infrastructure-related issues (i.e. networking, storage...) on the nodes where these `Pods` were scheduled.

-### Components
+## Components

 To be able to effectively provision and recover MariaDB Galera clusters, the following components were introduced to co-operate with `mariadb-operator`:
 - **[🍼 init](https://github.com/mariadb-operator/init)**: Init container that dynamically provisions the Galera configuration file before the MariaDB container starts. Guarantees ordered deployment of `Pods` even if `spec.podManagementPolicy = Parallel` is set on the MariaDB `StatefulSet`, something crucial for performing the Galera recovery, as the operator needs to restart `Pods` independently.
 - **[🤖 agent](https://github.com/mariadb-operator/agent)**: Sidecar agent that exposes the Galera state ([`grastate.dat`](https://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster/)) via HTTP and allows one to remotely bootstrap and recover the Galera cluster. For security reasons, it has authentication based on Kubernetes service accounts; this way only the `mariadb-operator` is able to call the agent.

-### Configuration
+## Configuration

 The easiest way to get a MariaDB Galera cluster up and running is setting `spec.galera.enabled = true`, like in this [example](../examples/manifests/mariadb_v1alpha1_mariadb_galera_minimal.yaml):

@ -74,7 +74,7 @@ spec:

 Refer to the [API Reference](#api-reference) below to better understand the purpose of each field.

-### API Reference
+## API Reference
 - [Go API pkg](https://pkg.go.dev/github.com/mariadb-operator/mariadb-operator@v0.0.16/api/v1alpha1#Galera)
 - [Code](../api/v1alpha1/mariadb_galera_types.go)
 - **`kubectl explain`**
@ -126,7 +126,7 @@ FIELDS:
    Once this timeout is reached, the Pod is restarted.
 ```

-### Quickstart
+## Quickstart

 Let's see how `mariadb-operator`🦭 and Galera✨ play together! First of all, install the following configuration manifests that will be referenced by the CRDs further:
 ```bash
@ -212,3 +212,183 @@ mariadb-galera   True    Running   All           82m
 ```

 To conclude, it's important to note that the Galera functionallity is 100% compatible with the rest of `mariadb-operator` constructs: `Backup`, `Restore`, `Connection`... refer to the [main quickstart guide](../README.md#quickstart) for more detail.
+
+## Troubleshooting
+
+The aim of this section is showing you how to diagnose your Galera cluster when something goes wrong. In this situations, observability is a key factor to understand the problem, so we recommend following these steps before jumping into debugging the problem.
+
+- Make sure network connectivity is fine by checking that you have an `Endpoint` per `Pod` in your Galera cluster.
+```bash
+kubectl get endpoints mariadb-galera-internal -o yaml
+apiVersion: v1
+kind: Endpoints
+metadata:
+  name: mariadb-internal
+subsets:
+- addresses:
+  - hostname: mariadb-1
+    ip: 10.255.140.181
+    nodeName: k8s-worker-1
+    targetRef:
+      kind: Pod
+      name: mariadb-1
+      namespace: mariadb
+  - hostname: mariadb-2
+    ip: 10.255.20.156
+    nodeName: k8s-worker-2
+    targetRef:
+      kind: Pod
+      name: mariadb-2
+      namespace: mariadb
+  - hostname: mariadb-0
+    ip: 10.255.214.164
+    nodeName: k8s-worker-0
+    targetRef:
+      kind: Pod
+      name: mariadb-0
+      namespace: mariadb
+  ports:
+  - name: sst
+    port: 4568
+    protocol: TCP
+  - name: ist
+    port: 4567
+    protocol: TCP
+  - name: mariadb
+    port: 3306
+    protocol: TCP
+  - name: agent
+    port: 5555
+    protocol: TCP
+  - name: cluster
+    port: 4444
+    protocol: TCP
+
+```
+- Check the events associated with the `MariaDB` object, as they provide significant insights for diagnosis, particularly within the context of cluster recovery.
+```bash
+kubectl get events --field-selector involvedObject.name=mariadb-galera --sort-by='.lastTimestamp'
+LAST SEEN   TYPE      REASON                    OBJECT                       MESSAGE
+...
+16m         Warning   GaleraClusterNotHealthy   mariadb/mariadb-galera       Galera cluster is not healthy
+16m         Normal    GaleraPodStateFetched     mariadb/mariadb-galera       Galera state fetched in Pod 'mariadb-galera-2'
+16m         Normal    GaleraPodStateFetched     mariadb/mariadb-galera       Galera state fetched in Pod 'mariadb-galera-1'
+16m         Normal    GaleraPodStateFetched     mariadb/mariadb-galera       Galera state fetched in Pod 'mariadb-galera-0'
+16m         Normal    GaleraPodRecovered        mariadb/mariadb-galera       Recovered Galera sequence in Pod 'mariadb-galera-1'
+16m         Normal    GaleraPodRecovered        mariadb/mariadb-galera       Recovered Galera sequence in Pod 'mariadb-galera-2'
+17m         Normal    GaleraPodRecovered        mariadb/mariadb-galera       Recovered Galera sequence in Pod 'mariadb-galera-0'
+17m         Normal    GaleraClusterBootstrap    mariadb/mariadb-galera       Bootstrapping Galera cluster in Pod 'mariadb-galera-2'
+20m         Normal    GaleraClusterHealthy      mariadb/mariadb-galera       Galera cluster is healthy
+```
+
+- Enable `debug` logs in `mariadb-operator`.
+
+```bash
+helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator --set logLevel=debug
+kubectl logs mariadb-operator-546c78f4f5-gq44k
+{"level":"info","ts":1691090524.4911606,"logger":"galera.health","msg":"Checking Galera cluster health","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e"}
+{"level":"debug","ts":1691090524.4911761,"logger":"galera.health","msg":"StatefulSet ready replicas","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e","replicas":1}
+```
+
+- Get the logs of all the `MariaDB` `Pod` containers, not only of the main `mariadb` container but also the `agent` and `init` ones.
+  
+```bash
+kubectl logs mariadb-galera-0 -c init
+{"level":"info","ts":1691090778.5239124,"msg":"Starting init"}
+{"level":"info","ts":1691090778.5305626,"msg":"Configuring Galera"}
+{"level":"info","ts":1691090778.5307593,"msg":"Already initialized. Init done"}
+
+kubectl logs mariadb-galera-0 -c agent
+{"level":"info","ts":1691090779.3193653,"logger":"server","msg":"server listening","addr":":5555"}
+2023/08/03 19:26:28 "POST http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 58B in 4.112086ms
+2023/08/03 19:26:28 "DELETE http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 0B in 883.544µs
+
+kubectl logs mariadb-galera-0 -c mariadb
+2023-08-03 19:27:10 0 [Note] WSREP: Member 2.0 (mariadb-galera-0) synced with group.
+2023-08-03 19:27:10 0 [Note] WSREP: Processing event queue:...100.0% (1/1 events) complete.
+2023-08-03 19:27:10 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6)
+2023-08-03 19:27:10 2 [Note] WSREP: Server mariadb-galera-0 synced with group
+2023-08-03 19:27:10 2 [Note] WSREP: Server status change joined -> synced
+2023-08-03 19:27:10 2 [Note] WSREP: Synchronized with group, ready for connections
+```
+
+Once you are done with these steps, you will have the context required to jump ahead to the [Common errors](#common-errors) section to see if any of them matches your case.  If they don't, feel free to open an issue or even a PR updating this document if you managed to resolve it.
+
+### Common errors
+
+#### Permission denied writing Galera configuration
+
+```bash
+ Error writing Galera config: open /etc/mysql/mariadb.conf.d/0-galera.cnf: permission denied
+```
+This error is returned by the `init` container when it is unable to write the configuration file in the filesystem backed by the PVC. In particular, this has been raised by users using longhorn and rook as a storage provider, which in some cases rely on root privileges for writing in the PVC:
+- https://github.com/longhorn/longhorn/issues/3549
+
+The remediation is running as root or match the user expected by the storage provider to be able to write in the PVC:
+
+```yaml
+apiVersion: mariadb.mmontes.io/v1alpha1
+kind: MariaDB
+metadata:
+  name: mariadb-galera
+spec:
+  podSecurityContext:
+    runAsUser: 0
+...
+```
+
+#### Unauthorized error disabling bootstrap
+
+```bash
+Error reconciling Galera: error disabling bootstrap in Pod 0: unauthorized
+```
+This situation occurs when the `mariadb-operator` credentials passed to the `agent` as authentication are either invalid or the `agent` is unable to verify them. To confirm this, ensure that both the `mariadb-operator` and the `MariaDB` `ServiceAccounts` are able to create `TokenReview` objects:
+
+```bash
+kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-operator | grep tokenreview
+tokenreviews.authentication.k8s.io              []                                    []               [create]
+
+kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-galera | grep tokenreview
+tokenreviews.authentication.k8s.io              []                                    []               [create]
+```
+
+If that's not the case, check that the following `ClusterRole` and `ClusterRoleBindings` are available in your cluster:
+```bash
+kubectl get clusterrole system:auth-delegator
+NAME                    CREATED AT
+system:auth-delegator   2023-08-03T19:12:37Z
+
+kubectl get clusterrolebinding | grep mariadb | grep auth-delegator
+mariadb-galera-auth:auth-delegator                     ClusterRole/system:auth-delegator                                                  108m
+mariadb-operator:auth-delegator                        ClusterRole/system:auth-delegator                                                  112m
+```
+`mariadb-operator:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-operator` `ServiceAccount` which is created by the helm chart, so you can re-install the helm release in order to recreate it:
+
+```bash
+ helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator
+```
+
+`mariadb-galera-auth:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-galera` `ServiceAccount` which is created on the flight by the operator as part of the reconciliation logic. You may check the `mariadb-operator` logs to see if there are any issues reconciling it.
+
+#### Timeout waiting for Pod to be Synced
+
+```bash
+Timeout waiting for Pod 'mariadb-galera-2' to be Synced
+```
+This error appears in the `mariadb-operator` logs when a `Pod` is in non synced state for a duration exceeding the `spec.galera.recovery.podRecoveryTimeout`. Just after, the operator will restart the `Pod`.
+
+Increase this timeout if you consider that your `Pod` may take longer to recover.
+
+#### Galera cluster bootstrap timed out
+
+```bash
+Galera cluster bootstrap timed out. Resetting recovery status
+```
+This is error is returned by the `mariadb-operator` after exceeding the `spec.galera.recovery.clusterBootstrapTimeout` when recovering the cluster. At this point, the operator will reset the recovered sequence numbers and start again from a clean state.
+
+Increase this timeout if you consider that your Galera cluster may take longer to recover.
+
+### GitHub Issues
+
+Here it is a list of GitHub issues reported by `mariadb-operator` users which might shed some light in your investigation:
+- https://github.com/mariadb-operator/mariadb-operator/issues?q=is%3Aclosed+label%3Agalera-troubleshoot+