Add troubleshooting section in GALERA.md

This commit is contained in:
Martin Montes
2023-08-03 21:25:32 +02:00
committed by Martín Montes
parent c6544d0583
commit eb19c72b90
2 changed files with 189 additions and 5 deletions

View File

@ -22,6 +22,10 @@ Tip: you can use
for code blocks of your kubectl output or YAML files.
-->
**Is the bug related to Galera?**
- [ ] I have checked the [Galera documentation](../../docs/GALERA.md)
- [ ] I have checked the [Galera troubleshooting guide](../../docs/GALERA.md#troubleshooting)
**Expected behaviour**
<!--A concise description of what you expected to happen.-->

View File

@ -1,4 +1,4 @@
## ✨ High availability via Galera
# ✨ High availability via Galera
The `mariadb-operator` provides cloud native support for provisioning and operating multi-master MariaDB clusters using Galera. This setup enables the ability to perform both read and write operations on all nodes, enhancing availability and allowing scalability across multiple nodes.
@ -6,13 +6,13 @@ In certain circumstances, it could be the case that all the nodes of your cluste
To accomplish this, after the MariaDB cluster has been provisioned, `mariadb-operator` will regularly monitor the cluster's status to make sure it is healthy. If any issues are detected, the operator will initiate the [recovery process](https://galeracluster.com/library/documentation/crash-recovery.html) to restore the cluster to a healthy state. During this process, the operator will set status conditions in the `MariaDB` and emit `Events` so you have a better understanding of the recovery progress and the underlying activities being performed. For example, you may want to know which `Pods` were out of sync to further investigate infrastructure-related issues (i.e. networking, storage...) on the nodes where these `Pods` were scheduled.
### Components
## Components
To be able to effectively provision and recover MariaDB Galera clusters, the following components were introduced to co-operate with `mariadb-operator`:
- **[🍼 init](https://github.com/mariadb-operator/init)**: Init container that dynamically provisions the Galera configuration file before the MariaDB container starts. Guarantees ordered deployment of `Pods` even if `spec.podManagementPolicy = Parallel` is set on the MariaDB `StatefulSet`, something crucial for performing the Galera recovery, as the operator needs to restart `Pods` independently.
- **[🤖 agent](https://github.com/mariadb-operator/agent)**: Sidecar agent that exposes the Galera state ([`grastate.dat`](https://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster/)) via HTTP and allows one to remotely bootstrap and recover the Galera cluster. For security reasons, it has authentication based on Kubernetes service accounts; this way only the `mariadb-operator` is able to call the agent.
### Configuration
## Configuration
The easiest way to get a MariaDB Galera cluster up and running is setting `spec.galera.enabled = true`, like in this [example](../examples/manifests/mariadb_v1alpha1_mariadb_galera_minimal.yaml):
@ -74,7 +74,7 @@ spec:
Refer to the [API Reference](#api-reference) below to better understand the purpose of each field.
### API Reference
## API Reference
- [Go API pkg](https://pkg.go.dev/github.com/mariadb-operator/mariadb-operator@v0.0.16/api/v1alpha1#Galera)
- [Code](../api/v1alpha1/mariadb_galera_types.go)
- **`kubectl explain`**
@ -126,7 +126,7 @@ FIELDS:
Once this timeout is reached, the Pod is restarted.
```
### Quickstart
## Quickstart
Let's see how `mariadb-operator`🦭 and Galera✨ play together! First of all, install the following configuration manifests that will be referenced by the CRDs further:
```bash
@ -212,3 +212,183 @@ mariadb-galera True Running All 82m
```
To conclude, it's important to note that the Galera functionallity is 100% compatible with the rest of `mariadb-operator` constructs: `Backup`, `Restore`, `Connection`... refer to the [main quickstart guide](../README.md#quickstart) for more detail.
## Troubleshooting
The aim of this section is showing you how to diagnose your Galera cluster when something goes wrong. In this situations, observability is a key factor to understand the problem, so we recommend following these steps before jumping into debugging the problem.
- Make sure network connectivity is fine by checking that you have an `Endpoint` per `Pod` in your Galera cluster.
```bash
kubectl get endpoints mariadb-galera-internal -o yaml
apiVersion: v1
kind: Endpoints
metadata:
name: mariadb-internal
subsets:
- addresses:
- hostname: mariadb-1
ip: 10.255.140.181
nodeName: k8s-worker-1
targetRef:
kind: Pod
name: mariadb-1
namespace: mariadb
- hostname: mariadb-2
ip: 10.255.20.156
nodeName: k8s-worker-2
targetRef:
kind: Pod
name: mariadb-2
namespace: mariadb
- hostname: mariadb-0
ip: 10.255.214.164
nodeName: k8s-worker-0
targetRef:
kind: Pod
name: mariadb-0
namespace: mariadb
ports:
- name: sst
port: 4568
protocol: TCP
- name: ist
port: 4567
protocol: TCP
- name: mariadb
port: 3306
protocol: TCP
- name: agent
port: 5555
protocol: TCP
- name: cluster
port: 4444
protocol: TCP
```
- Check the events associated with the `MariaDB` object, as they provide significant insights for diagnosis, particularly within the context of cluster recovery.
```bash
kubectl get events --field-selector involvedObject.name=mariadb-galera --sort-by='.lastTimestamp'
LAST SEEN TYPE REASON OBJECT MESSAGE
...
16m Warning GaleraClusterNotHealthy mariadb/mariadb-galera Galera cluster is not healthy
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-2'
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-1'
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-0'
16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-1'
16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-2'
17m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-0'
17m Normal GaleraClusterBootstrap mariadb/mariadb-galera Bootstrapping Galera cluster in Pod 'mariadb-galera-2'
20m Normal GaleraClusterHealthy mariadb/mariadb-galera Galera cluster is healthy
```
- Enable `debug` logs in `mariadb-operator`.
```bash
helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator --set logLevel=debug
kubectl logs mariadb-operator-546c78f4f5-gq44k
{"level":"info","ts":1691090524.4911606,"logger":"galera.health","msg":"Checking Galera cluster health","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e"}
{"level":"debug","ts":1691090524.4911761,"logger":"galera.health","msg":"StatefulSet ready replicas","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e","replicas":1}
```
- Get the logs of all the `MariaDB` `Pod` containers, not only of the main `mariadb` container but also the `agent` and `init` ones.
```bash
kubectl logs mariadb-galera-0 -c init
{"level":"info","ts":1691090778.5239124,"msg":"Starting init"}
{"level":"info","ts":1691090778.5305626,"msg":"Configuring Galera"}
{"level":"info","ts":1691090778.5307593,"msg":"Already initialized. Init done"}
kubectl logs mariadb-galera-0 -c agent
{"level":"info","ts":1691090779.3193653,"logger":"server","msg":"server listening","addr":":5555"}
2023/08/03 19:26:28 "POST http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 58B in 4.112086ms
2023/08/03 19:26:28 "DELETE http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 0B in 883.544µs
kubectl logs mariadb-galera-0 -c mariadb
2023-08-03 19:27:10 0 [Note] WSREP: Member 2.0 (mariadb-galera-0) synced with group.
2023-08-03 19:27:10 0 [Note] WSREP: Processing event queue:...100.0% (1/1 events) complete.
2023-08-03 19:27:10 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6)
2023-08-03 19:27:10 2 [Note] WSREP: Server mariadb-galera-0 synced with group
2023-08-03 19:27:10 2 [Note] WSREP: Server status change joined -> synced
2023-08-03 19:27:10 2 [Note] WSREP: Synchronized with group, ready for connections
```
Once you are done with these steps, you will have the context required to jump ahead to the [Common errors](#common-errors) section to see if any of them matches your case. If they don't, feel free to open an issue or even a PR updating this document if you managed to resolve it.
### Common errors
#### Permission denied writing Galera configuration
```bash
Error writing Galera config: open /etc/mysql/mariadb.conf.d/0-galera.cnf: permission denied
```
This error is returned by the `init` container when it is unable to write the configuration file in the filesystem backed by the PVC. In particular, this has been raised by users using longhorn and rook as a storage provider, which in some cases rely on root privileges for writing in the PVC:
- https://github.com/longhorn/longhorn/issues/3549
The remediation is running as root or match the user expected by the storage provider to be able to write in the PVC:
```yaml
apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
name: mariadb-galera
spec:
podSecurityContext:
runAsUser: 0
...
```
#### Unauthorized error disabling bootstrap
```bash
Error reconciling Galera: error disabling bootstrap in Pod 0: unauthorized
```
This situation occurs when the `mariadb-operator` credentials passed to the `agent` as authentication are either invalid or the `agent` is unable to verify them. To confirm this, ensure that both the `mariadb-operator` and the `MariaDB` `ServiceAccounts` are able to create `TokenReview` objects:
```bash
kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-operator | grep tokenreview
tokenreviews.authentication.k8s.io [] [] [create]
kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-galera | grep tokenreview
tokenreviews.authentication.k8s.io [] [] [create]
```
If that's not the case, check that the following `ClusterRole` and `ClusterRoleBindings` are available in your cluster:
```bash
kubectl get clusterrole system:auth-delegator
NAME CREATED AT
system:auth-delegator 2023-08-03T19:12:37Z
kubectl get clusterrolebinding | grep mariadb | grep auth-delegator
mariadb-galera-auth:auth-delegator ClusterRole/system:auth-delegator 108m
mariadb-operator:auth-delegator ClusterRole/system:auth-delegator 112m
```
`mariadb-operator:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-operator` `ServiceAccount` which is created by the helm chart, so you can re-install the helm release in order to recreate it:
```bash
helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator
```
`mariadb-galera-auth:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-galera` `ServiceAccount` which is created on the flight by the operator as part of the reconciliation logic. You may check the `mariadb-operator` logs to see if there are any issues reconciling it.
#### Timeout waiting for Pod to be Synced
```bash
Timeout waiting for Pod 'mariadb-galera-2' to be Synced
```
This error appears in the `mariadb-operator` logs when a `Pod` is in non synced state for a duration exceeding the `spec.galera.recovery.podRecoveryTimeout`. Just after, the operator will restart the `Pod`.
Increase this timeout if you consider that your `Pod` may take longer to recover.
#### Galera cluster bootstrap timed out
```bash
Galera cluster bootstrap timed out. Resetting recovery status
```
This is error is returned by the `mariadb-operator` after exceeding the `spec.galera.recovery.clusterBootstrapTimeout` when recovering the cluster. At this point, the operator will reset the recovered sequence numbers and start again from a clean state.
Increase this timeout if you consider that your Galera cluster may take longer to recover.
### GitHub Issues
Here it is a list of GitHub issues reported by `mariadb-operator` users which might shed some light in your investigation:
- https://github.com/mariadb-operator/mariadb-operator/issues?q=is%3Aclosed+label%3Agalera-troubleshoot+