通过Prometheus来做SLI/SLO监控展示_服务器知识

通过Prometheus来做SLI/SLO监控展示

什么是SLI/SLO

SLI，全名Service Level Indicator，是服务等级指标的简称，它是衡定系统稳定性的指标。

SLO，全名Sevice Level Objective，是服务等级目标的简称，也就是我们设定的稳定性目标，比如"4个9"，"5个9"等。

SRE通常通过这两个指标来衡量系统的稳定性，其主要思路就是通过SLI来判断SLO，也就是通过一系列的指标来衡量我们的目标是否达到了"几个9"。

如何选择SLI

在系统中，常见的指标有很多种，比如：

这么多指标，应该如何选择呢?只要遵从两个原则就可以：

通常情况下，可以直接使用谷歌的VALET指标方法。

这就是谷歌使用VALET方法给的样例。

通过Prometheus来做SLI/SLO监控展示

上面仅仅是简单的介绍了一下SLI/SLO，更多的知识可以学习《SRE：Google运维解密》和赵成老师的极客时间课程《SRE实践手册》。下面来简单介绍如何使用Prometheus来进行SLI/SLO监控。

Service level operator是为了Kubernetes中的应用SLI/SLO指标来衡量应用的服务指标，并可以通过Grafana来进行展示。

Operator主要是通过SLO来查看和创建新的指标。例如：

apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
name: awesome-service
spec:
serviceLevelObjectives:
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
serviceLevelIndicator:
prometheus:
address: http://myprometheus:9090
totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m]))
errorQuery: sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m]))
output:
prometheus:
labels:
team: a-team
iteration: "3"

Operator通过totalQuert和errorQuery就可以计算出SLO的指标了。

(1)首先创建RBAC

（2）然后创建Deployment

（3）创建service

（4）创建prometheus serviceMonitor

到这里，Service Level Operator部署完成了，可以在prometheus上查看到对应的Target，如下：

通过Prometheus来做SLI/SLO监控展示

然后就需要创建对应的服务指标了，如下所示创建一个示例。

apiVersion: monitoring.spotahome.com/v1alpha1
kind: ServiceLevel
metadata:
name: prometheus-grafana-service
namespace: monitoring
spec:
serviceLevelObjectives:
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
serviceLevelIndicator:
prometheus:
address: http://prometheus-k8s.monitoring.svc:9090
totalQuery: sum(increase(http_request_total{service="grafana"}[2m]))
errorQuery: sum(increase(http_request_total{service="grafana", code=~"5.."}[2m]))
output:
prometheus:
labels:
team: prometheus-grafana
iteration: "3"

上面定义了grafana应用"4个9"的SLO。

然后可以在Prometheus上看到具体的指标，如下。

通过Prometheus来做SLI/SLO监控展示

接下来在Grafana上导入ID为8793的Dashboard，即可生成如下图表。

通过Prometheus来做SLI/SLO监控展示

上面是SLI，下面是错误总预算和已消耗的错误。

下面可以定义告警规则，当SLO下降时可以第一时间收到，比如：

groups:
- name: slo.rules
rules:
- alert: SLOErrorRateTooFast1h
expr: |
(
increase(service_level_sli_result_error_ratio_total[1h])
/
increase(service_level_sli_result_count_total[1h])
) > (1 - service_level_slo_objective_ratio) * 14.6
labels:
severity: critical
team: a-team
annotations:
summary: The monthly SLO error budget consumed for 1h is greater than 2%
description: The error rate for 1h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 2% monthly budget.
- alert: SLOErrorRateTooFast6h
expr: |
(
increase(service_level_sli_result_error_ratio_total[6h])
/
increase(service_level_sli_result_count_total[6h])
) > (1 - service_level_slo_objective_ratio) * 6
labels:
severity: critical
team: a-team
annotations:
summary: The monthly SLO error budget consumed for 6h is greater than 5%
description: The error rate for 6h in the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is being consumed too fast, is greater than 5% monthly budget.