第1章 监控技术栈
系统监控是保障服务稳定运行的重要手段,涵盖指标监控、日志监控、链路追踪等多个维度。
🎯 学习目标
掌握现代监控体系的设计原理和实现方法,学会构建全方位的服务监控系统。
监控的重要性
故障发现
- 实时报警 - 快速发现系统异常
- 性能监控 - 识别性能瓶颈
- 容量规划 - 预测资源需求
运维效率
- 自动化运维 - 减少人工干预
- 根因分析 - 快速定位问题
- 趋势分析 - 优化系统性能
📊 Prometheus监控体系
指标收集
# prometheus.yml配置
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/actuator/prometheus'
PromQL查询
# CPU使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# HTTP请求成功率
rate(http_requests_total{status=~"2.."}[5m]) / rate(http_requests_total[5m]) * 100
# 95百分位响应时间
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
🔔 告警配置
AlertManager配置
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@example.com'
subject: '【告警】{{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
告警名称: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
开始时间: {{ .StartsAt }}
{{ end }}
告警规则
# alert_rules.yml
groups:
- name: system.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 2m
labels:
severity: critical
annotations:
summary: "内存使用率过高"
description: "实例 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"
📈 Grafana可视化
Dashboard配置
{
"dashboard": {
"title": "系统监控",
"panels": [
{
"title": "CPU使用率",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"yAxes": [
{
"max": 100,
"min": 0,
"unit": "percent"
}
]
}
]
}
}
🔍 链路追踪
Jaeger配置
# jaeger-docker-compose.yml
version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14268:14268"
environment:
- COLLECTOR_OTLP_ENABLED=true
应用集成
// Spring Boot应用集成
@RestController
public class UserController {
@Autowired
private Tracer tracer;
@GetMapping("/users/{id}")
public User getUser(@PathVariable String id) {
Span span = tracer.nextSpan()
.name("get-user")
.tag("user.id", id)
.start();
try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
return userService.findById(id);
} finally {
span.end();
}
}
}
📋 日志监控
ELK Stack配置
# docker-compose.yml
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:7.15.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5044:5044"
kibana:
image: docker.elastic.co/kibana/kibana:7.15.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
下一步: 深入学习APM(应用性能监控)、自定义指标开发和监控自动化。