跳到主要内容

第1章 监控技术栈

系统监控是保障服务稳定运行的重要手段,涵盖指标监控、日志监控、链路追踪等多个维度。

🎯 学习目标

掌握现代监控体系的设计原理和实现方法,学会构建全方位的服务监控系统。

监控的重要性

故障发现

  • 实时报警 - 快速发现系统异常
  • 性能监控 - 识别性能瓶颈
  • 容量规划 - 预测资源需求

运维效率

  • 自动化运维 - 减少人工干预
  • 根因分析 - 快速定位问题
  • 趋势分析 - 优化系统性能

📊 Prometheus监控体系

指标收集


# prometheus.yml配置

global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "alert_rules.yml"

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']

- job_name: 'application'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/actuator/prometheus'

PromQL查询


# CPU使用率

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)


# 内存使用率

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100


# HTTP请求成功率

rate(http_requests_total{status=~"2.."}[5m]) / rate(http_requests_total[5m]) * 100


# 95百分位响应时间

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

🔔 告警配置

AlertManager配置


# alertmanager.yml

global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@example.com'

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'

receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@example.com'
subject: '【告警】{{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
告警名称: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
开始时间: {{ .StartsAt }}
{{ end }}

告警规则


# alert_rules.yml

groups:
- name: system.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"

- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 2m
labels:
severity: critical
annotations:
summary: "内存使用率过高"
description: "实例 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"

📈 Grafana可视化

Dashboard配置

{
"dashboard": {
"title": "系统监控",
"panels": [
{
"title": "CPU使用率",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"yAxes": [
{
"max": 100,
"min": 0,
"unit": "percent"
}
]
}
]
}
}

🔍 链路追踪

Jaeger配置


# jaeger-docker-compose.yml

version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14268:14268"
environment:
- COLLECTOR_OTLP_ENABLED=true

应用集成

// Spring Boot应用集成
@RestController
public class UserController {

@Autowired
private Tracer tracer;

@GetMapping("/users/{id}")
public User getUser(@PathVariable String id) {
Span span = tracer.nextSpan()
.name("get-user")
.tag("user.id", id)
.start();

try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
return userService.findById(id);
} finally {
span.end();
}
}
}

📋 日志监控

ELK Stack配置


# docker-compose.yml

version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"

logstash:
image: docker.elastic.co/logstash/logstash:7.15.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5044:5044"

kibana:
image: docker.elastic.co/kibana/kibana:7.15.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200

下一步: 深入学习APM(应用性能监控)、自定义指标开发和监控自动化。