监控技术栈
系统监控是保障服务稳定运行的重要手段,涵盖指标监控、日志监控、链路追踪等多个维度。
🎯 学习目标
掌握现代监控体系的设计原理和实现方法,学会构建全方位的服务监控系统。
监控的重要性
故障发现
- 实时报警 - 快速发现系统异常
- 性能监控 - 识别性能瓶颈
- 容量规划 - 预测资源需求
运维效率
- 自动化运维 - 减少人工干预
- 根因分析 - 快速定位问题
- 趋势分析 - 优化系统性能
📊 Prometheus监控体系
指标收集
# prometheus.yml配置
global:
  scrape_interval: 15s
  evaluation_interval: 15s
rule_files:
  - "alert_rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'application'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/actuator/prometheus'
PromQL查询
# CPU使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# HTTP请求成功率
rate(http_requests_total{status=~"2.."}[5m]) / rate(http_requests_total[5m]) * 100
# 95百分位响应时间
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
🔔 告警配置
AlertManager配置
# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@example.com'
    subject: '【告警】{{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      告警名称: {{ .Annotations.summary }}
      告警详情: {{ .Annotations.description }}
      开始时间: {{ .StartsAt }}
      {{ end }}
告警规则
# alert_rules.yml
groups:
- name: system.rules
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率过高"
      description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"
  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "内存使用率过高"
      description: "实例 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"
📈 Grafana可视化
Dashboard配置
{
  "dashboard": {
    "title": "系统监控",
    "panels": [
      {
        "title": "CPU使用率",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "max": 100,
            "min": 0,
            "unit": "percent"
          }
        ]
      }
    ]
  }
}
🔍 链路追踪
Jaeger配置
# jaeger-docker-compose.yml
version: '3'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14268:14268"
    environment:
      - COLLECTOR_OTLP_ENABLED=true
应用集成
// Spring Boot应用集成
@RestController
public class UserController {
    @Autowired
    private Tracer tracer;
    @GetMapping("/users/{id}")
    public User getUser(@PathVariable String id) {
        Span span = tracer.nextSpan()
            .name("get-user")
            .tag("user.id", id)
            .start();
        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            return userService.findById(id);
        } finally {
            span.end();
        }
    }
}
📋 日志监控
ELK Stack配置
# docker-compose.yml
version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    environment:
      - discovery.type=single-node
    ports:
      - "9200:9200"
  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5044:5044"
  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
下一步: 深入学习APM(应用性能监控)、自定义指标开发和监控自动化。