监控技术栈

系统监控是保障服务稳定运行的重要手段，涵盖指标监控、日志监控、链路追踪等多个维度。

🎯 学习目标

掌握现代监控体系的设计原理和实现方法，学会构建全方位的服务监控系统。

监控的重要性

故障发现

实时报警 - 快速发现系统异常
性能监控 - 识别性能瓶颈
容量规划 - 预测资源需求

运维效率

自动化运维 - 减少人工干预
根因分析 - 快速定位问题
趋势分析 - 优化系统性能

📊 Prometheus监控体系

指标收集

# prometheus.yml配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/actuator/prometheus'

PromQL查询

# CPU使用率

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP请求成功率

rate(http_requests_total{status=~"2.."}[5m]) / rate(http_requests_total[5m]) * 100

# 95百分位响应时间

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

🔔 告警配置

AlertManager配置

# alertmanager.yml

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@example.com'
    subject: '【告警】{{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      告警名称: {{ .Annotations.summary }}
      告警详情: {{ .Annotations.description }}
      开始时间: {{ .StartsAt }}
      {{ end }}

告警规则

# alert_rules.yml

groups:
- name: system.rules
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率过高"
      description: "实例 {{ $labels.instance }} CPU使用率超过80%，当前值: {{ $value }}%"

  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "内存使用率过高"
      description: "实例 {{ $labels.instance }} 内存使用率超过85%，当前值: {{ $value }}%"

📈 Grafana可视化

Dashboard配置

{
  "dashboard": {
    "title": "系统监控",
    "panels": [
      {
        "title": "CPU使用率",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "max": 100,
            "min": 0,
            "unit": "percent"
          }
        ]
      }
    ]
  }
}

🔍 链路追踪

Jaeger配置

# jaeger-docker-compose.yml

version: '3'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14268:14268"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

应用集成

// Spring Boot应用集成
@RestController
public class UserController {

    @Autowired
    private Tracer tracer;

    @GetMapping("/users/{id}")
    public User getUser(@PathVariable String id) {
        Span span = tracer.nextSpan()
            .name("get-user")
            .tag("user.id", id)
            .start();

        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            return userService.findById(id);
        } finally {
            span.end();
        }
    }
}

📋 日志监控

ELK Stack配置

# docker-compose.yml

version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    environment:
      - discovery.type=single-node
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5044:5044"

  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

下一步: 深入学习APM(应用性能监控)、自定义指标开发和监控自动化。

🎯 学习目标​

监控的重要性​

故障发现​

运维效率​

📊 Prometheus监控体系​

指标收集​

PromQL查询​

🔔 告警配置​

AlertManager配置​

告警规则​

📈 Grafana可视化​

Dashboard配置​

🔍 链路追踪​

Jaeger配置​

应用集成​

📋 日志监控​

ELK Stack配置​

🎯 学习目标

监控的重要性

故障发现

运维效率

📊 Prometheus监控体系

指标收集

PromQL查询

🔔 告警配置

AlertManager配置

告警规则

📈 Grafana可视化

Dashboard配置

🔍 链路追踪

Jaeger配置

应用集成

📋 日志监控

ELK Stack配置