mirror of
https://github.com/mariadb-operator/mariadb-operator.git
synced 2025-07-22 18:27:44 +00:00
Add troubleshooting section in GALERA.md
This commit is contained in:

committed by
Martín Montes

parent
c6544d0583
commit
eb19c72b90
4
.github/ISSUE_TEMPLATE/bug.md
vendored
4
.github/ISSUE_TEMPLATE/bug.md
vendored
@ -22,6 +22,10 @@ Tip: you can use
|
||||
for code blocks of your kubectl output or YAML files.
|
||||
-->
|
||||
|
||||
**Is the bug related to Galera?**
|
||||
- [ ] I have checked the [Galera documentation](../../docs/GALERA.md)
|
||||
- [ ] I have checked the [Galera troubleshooting guide](../../docs/GALERA.md#troubleshooting)
|
||||
|
||||
**Expected behaviour**
|
||||
<!--A concise description of what you expected to happen.-->
|
||||
|
||||
|
190
docs/GALERA.md
190
docs/GALERA.md
@ -1,4 +1,4 @@
|
||||
## ✨ High availability via Galera
|
||||
# ✨ High availability via Galera
|
||||
|
||||
The `mariadb-operator` provides cloud native support for provisioning and operating multi-master MariaDB clusters using Galera. This setup enables the ability to perform both read and write operations on all nodes, enhancing availability and allowing scalability across multiple nodes.
|
||||
|
||||
@ -6,13 +6,13 @@ In certain circumstances, it could be the case that all the nodes of your cluste
|
||||
|
||||
To accomplish this, after the MariaDB cluster has been provisioned, `mariadb-operator` will regularly monitor the cluster's status to make sure it is healthy. If any issues are detected, the operator will initiate the [recovery process](https://galeracluster.com/library/documentation/crash-recovery.html) to restore the cluster to a healthy state. During this process, the operator will set status conditions in the `MariaDB` and emit `Events` so you have a better understanding of the recovery progress and the underlying activities being performed. For example, you may want to know which `Pods` were out of sync to further investigate infrastructure-related issues (i.e. networking, storage...) on the nodes where these `Pods` were scheduled.
|
||||
|
||||
### Components
|
||||
## Components
|
||||
|
||||
To be able to effectively provision and recover MariaDB Galera clusters, the following components were introduced to co-operate with `mariadb-operator`:
|
||||
- **[🍼 init](https://github.com/mariadb-operator/init)**: Init container that dynamically provisions the Galera configuration file before the MariaDB container starts. Guarantees ordered deployment of `Pods` even if `spec.podManagementPolicy = Parallel` is set on the MariaDB `StatefulSet`, something crucial for performing the Galera recovery, as the operator needs to restart `Pods` independently.
|
||||
- **[🤖 agent](https://github.com/mariadb-operator/agent)**: Sidecar agent that exposes the Galera state ([`grastate.dat`](https://galeracluster.com/2016/11/introducing-the-safe-to-bootstrap-feature-in-galera-cluster/)) via HTTP and allows one to remotely bootstrap and recover the Galera cluster. For security reasons, it has authentication based on Kubernetes service accounts; this way only the `mariadb-operator` is able to call the agent.
|
||||
|
||||
### Configuration
|
||||
## Configuration
|
||||
|
||||
The easiest way to get a MariaDB Galera cluster up and running is setting `spec.galera.enabled = true`, like in this [example](../examples/manifests/mariadb_v1alpha1_mariadb_galera_minimal.yaml):
|
||||
|
||||
@ -74,7 +74,7 @@ spec:
|
||||
|
||||
Refer to the [API Reference](#api-reference) below to better understand the purpose of each field.
|
||||
|
||||
### API Reference
|
||||
## API Reference
|
||||
- [Go API pkg](https://pkg.go.dev/github.com/mariadb-operator/mariadb-operator@v0.0.16/api/v1alpha1#Galera)
|
||||
- [Code](../api/v1alpha1/mariadb_galera_types.go)
|
||||
- **`kubectl explain`**
|
||||
@ -126,7 +126,7 @@ FIELDS:
|
||||
Once this timeout is reached, the Pod is restarted.
|
||||
```
|
||||
|
||||
### Quickstart
|
||||
## Quickstart
|
||||
|
||||
Let's see how `mariadb-operator`🦭 and Galera✨ play together! First of all, install the following configuration manifests that will be referenced by the CRDs further:
|
||||
```bash
|
||||
@ -212,3 +212,183 @@ mariadb-galera True Running All 82m
|
||||
```
|
||||
|
||||
To conclude, it's important to note that the Galera functionallity is 100% compatible with the rest of `mariadb-operator` constructs: `Backup`, `Restore`, `Connection`... refer to the [main quickstart guide](../README.md#quickstart) for more detail.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
The aim of this section is showing you how to diagnose your Galera cluster when something goes wrong. In this situations, observability is a key factor to understand the problem, so we recommend following these steps before jumping into debugging the problem.
|
||||
|
||||
- Make sure network connectivity is fine by checking that you have an `Endpoint` per `Pod` in your Galera cluster.
|
||||
```bash
|
||||
kubectl get endpoints mariadb-galera-internal -o yaml
|
||||
apiVersion: v1
|
||||
kind: Endpoints
|
||||
metadata:
|
||||
name: mariadb-internal
|
||||
subsets:
|
||||
- addresses:
|
||||
- hostname: mariadb-1
|
||||
ip: 10.255.140.181
|
||||
nodeName: k8s-worker-1
|
||||
targetRef:
|
||||
kind: Pod
|
||||
name: mariadb-1
|
||||
namespace: mariadb
|
||||
- hostname: mariadb-2
|
||||
ip: 10.255.20.156
|
||||
nodeName: k8s-worker-2
|
||||
targetRef:
|
||||
kind: Pod
|
||||
name: mariadb-2
|
||||
namespace: mariadb
|
||||
- hostname: mariadb-0
|
||||
ip: 10.255.214.164
|
||||
nodeName: k8s-worker-0
|
||||
targetRef:
|
||||
kind: Pod
|
||||
name: mariadb-0
|
||||
namespace: mariadb
|
||||
ports:
|
||||
- name: sst
|
||||
port: 4568
|
||||
protocol: TCP
|
||||
- name: ist
|
||||
port: 4567
|
||||
protocol: TCP
|
||||
- name: mariadb
|
||||
port: 3306
|
||||
protocol: TCP
|
||||
- name: agent
|
||||
port: 5555
|
||||
protocol: TCP
|
||||
- name: cluster
|
||||
port: 4444
|
||||
protocol: TCP
|
||||
|
||||
```
|
||||
- Check the events associated with the `MariaDB` object, as they provide significant insights for diagnosis, particularly within the context of cluster recovery.
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=mariadb-galera --sort-by='.lastTimestamp'
|
||||
LAST SEEN TYPE REASON OBJECT MESSAGE
|
||||
...
|
||||
16m Warning GaleraClusterNotHealthy mariadb/mariadb-galera Galera cluster is not healthy
|
||||
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-2'
|
||||
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-1'
|
||||
16m Normal GaleraPodStateFetched mariadb/mariadb-galera Galera state fetched in Pod 'mariadb-galera-0'
|
||||
16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-1'
|
||||
16m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-2'
|
||||
17m Normal GaleraPodRecovered mariadb/mariadb-galera Recovered Galera sequence in Pod 'mariadb-galera-0'
|
||||
17m Normal GaleraClusterBootstrap mariadb/mariadb-galera Bootstrapping Galera cluster in Pod 'mariadb-galera-2'
|
||||
20m Normal GaleraClusterHealthy mariadb/mariadb-galera Galera cluster is healthy
|
||||
```
|
||||
|
||||
- Enable `debug` logs in `mariadb-operator`.
|
||||
|
||||
```bash
|
||||
helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator --set logLevel=debug
|
||||
kubectl logs mariadb-operator-546c78f4f5-gq44k
|
||||
{"level":"info","ts":1691090524.4911606,"logger":"galera.health","msg":"Checking Galera cluster health","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e"}
|
||||
{"level":"debug","ts":1691090524.4911761,"logger":"galera.health","msg":"StatefulSet ready replicas","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","statefulSet":{"name":"mariadb-galera","namespace":"default"},"namespace":"default","name":"mariadb-galera","reconcileID":"098620db-4486-45cc-966a-9f3fec0d165e","replicas":1}
|
||||
```
|
||||
|
||||
- Get the logs of all the `MariaDB` `Pod` containers, not only of the main `mariadb` container but also the `agent` and `init` ones.
|
||||
|
||||
```bash
|
||||
kubectl logs mariadb-galera-0 -c init
|
||||
{"level":"info","ts":1691090778.5239124,"msg":"Starting init"}
|
||||
{"level":"info","ts":1691090778.5305626,"msg":"Configuring Galera"}
|
||||
{"level":"info","ts":1691090778.5307593,"msg":"Already initialized. Init done"}
|
||||
|
||||
kubectl logs mariadb-galera-0 -c agent
|
||||
{"level":"info","ts":1691090779.3193653,"logger":"server","msg":"server listening","addr":":5555"}
|
||||
2023/08/03 19:26:28 "POST http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 58B in 4.112086ms
|
||||
2023/08/03 19:26:28 "DELETE http://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local:5555/api/recovery HTTP/1.1" from 10.244.4.2:39162 - 200 0B in 883.544µs
|
||||
|
||||
kubectl logs mariadb-galera-0 -c mariadb
|
||||
2023-08-03 19:27:10 0 [Note] WSREP: Member 2.0 (mariadb-galera-0) synced with group.
|
||||
2023-08-03 19:27:10 0 [Note] WSREP: Processing event queue:...100.0% (1/1 events) complete.
|
||||
2023-08-03 19:27:10 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6)
|
||||
2023-08-03 19:27:10 2 [Note] WSREP: Server mariadb-galera-0 synced with group
|
||||
2023-08-03 19:27:10 2 [Note] WSREP: Server status change joined -> synced
|
||||
2023-08-03 19:27:10 2 [Note] WSREP: Synchronized with group, ready for connections
|
||||
```
|
||||
|
||||
Once you are done with these steps, you will have the context required to jump ahead to the [Common errors](#common-errors) section to see if any of them matches your case. If they don't, feel free to open an issue or even a PR updating this document if you managed to resolve it.
|
||||
|
||||
### Common errors
|
||||
|
||||
#### Permission denied writing Galera configuration
|
||||
|
||||
```bash
|
||||
Error writing Galera config: open /etc/mysql/mariadb.conf.d/0-galera.cnf: permission denied
|
||||
```
|
||||
This error is returned by the `init` container when it is unable to write the configuration file in the filesystem backed by the PVC. In particular, this has been raised by users using longhorn and rook as a storage provider, which in some cases rely on root privileges for writing in the PVC:
|
||||
- https://github.com/longhorn/longhorn/issues/3549
|
||||
|
||||
The remediation is running as root or match the user expected by the storage provider to be able to write in the PVC:
|
||||
|
||||
```yaml
|
||||
apiVersion: mariadb.mmontes.io/v1alpha1
|
||||
kind: MariaDB
|
||||
metadata:
|
||||
name: mariadb-galera
|
||||
spec:
|
||||
podSecurityContext:
|
||||
runAsUser: 0
|
||||
...
|
||||
```
|
||||
|
||||
#### Unauthorized error disabling bootstrap
|
||||
|
||||
```bash
|
||||
Error reconciling Galera: error disabling bootstrap in Pod 0: unauthorized
|
||||
```
|
||||
This situation occurs when the `mariadb-operator` credentials passed to the `agent` as authentication are either invalid or the `agent` is unable to verify them. To confirm this, ensure that both the `mariadb-operator` and the `MariaDB` `ServiceAccounts` are able to create `TokenReview` objects:
|
||||
|
||||
```bash
|
||||
kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-operator | grep tokenreview
|
||||
tokenreviews.authentication.k8s.io [] [] [create]
|
||||
|
||||
kubectl auth can-i --list --as=system:serviceaccount:default:mariadb-galera | grep tokenreview
|
||||
tokenreviews.authentication.k8s.io [] [] [create]
|
||||
```
|
||||
|
||||
If that's not the case, check that the following `ClusterRole` and `ClusterRoleBindings` are available in your cluster:
|
||||
```bash
|
||||
kubectl get clusterrole system:auth-delegator
|
||||
NAME CREATED AT
|
||||
system:auth-delegator 2023-08-03T19:12:37Z
|
||||
|
||||
kubectl get clusterrolebinding | grep mariadb | grep auth-delegator
|
||||
mariadb-galera-auth:auth-delegator ClusterRole/system:auth-delegator 108m
|
||||
mariadb-operator:auth-delegator ClusterRole/system:auth-delegator 112m
|
||||
```
|
||||
`mariadb-operator:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-operator` `ServiceAccount` which is created by the helm chart, so you can re-install the helm release in order to recreate it:
|
||||
|
||||
```bash
|
||||
helm upgrade --install mariadb-operator mariadb-operator/mariadb-operator
|
||||
```
|
||||
|
||||
`mariadb-galera-auth:auth-delegator` is the `ClusterRoleBinding` bound to the `mariadb-galera` `ServiceAccount` which is created on the flight by the operator as part of the reconciliation logic. You may check the `mariadb-operator` logs to see if there are any issues reconciling it.
|
||||
|
||||
#### Timeout waiting for Pod to be Synced
|
||||
|
||||
```bash
|
||||
Timeout waiting for Pod 'mariadb-galera-2' to be Synced
|
||||
```
|
||||
This error appears in the `mariadb-operator` logs when a `Pod` is in non synced state for a duration exceeding the `spec.galera.recovery.podRecoveryTimeout`. Just after, the operator will restart the `Pod`.
|
||||
|
||||
Increase this timeout if you consider that your `Pod` may take longer to recover.
|
||||
|
||||
#### Galera cluster bootstrap timed out
|
||||
|
||||
```bash
|
||||
Galera cluster bootstrap timed out. Resetting recovery status
|
||||
```
|
||||
This is error is returned by the `mariadb-operator` after exceeding the `spec.galera.recovery.clusterBootstrapTimeout` when recovering the cluster. At this point, the operator will reset the recovered sequence numbers and start again from a clean state.
|
||||
|
||||
Increase this timeout if you consider that your Galera cluster may take longer to recover.
|
||||
|
||||
### GitHub Issues
|
||||
|
||||
Here it is a list of GitHub issues reported by `mariadb-operator` users which might shed some light in your investigation:
|
||||
- https://github.com/mariadb-operator/mariadb-operator/issues?q=is%3Aclosed+label%3Agalera-troubleshoot+
|
Reference in New Issue
Block a user