cloudflare-log-collector

Architecture

Data flow from Cloudflare's GraphQL API through the collector into the observability stack

flowchart TD
    CF["Cloudflare GraphQL API"]
    POLL["Poll Scheduler"]
    FW["Firewall Collector"]
    HTTP["HTTP Collector"]
    METRICS["Metrics Exporter"]
    LOKIC["Loki Client"]
    TRACE["Trace Context"]
    SLOG["Structured Logger"]
    LOKI["Loki"]
    PROM["Prometheus"]
    TEMPO["Tempo"]
    GRAFANA["Grafana"]

    CF -->|"GraphQL queries"| POLL
    POLL --> FW
    POLL --> HTTP
    FW -->|"JSON log lines"| LOKIC
    HTTP -->|"JSON log lines"| LOKIC
    HTTP -->|"gauge updates"| METRICS
    FW -->|"event counters"| METRICS
    LOKIC -->|"POST /loki/api/v1/push"| LOKI
    METRICS -->|"/metrics"| PROM
    TRACE -->|"OTLP gRPC"| TEMPO
    SLOG -->|"trace_id injection"| TRACE

    LOKI --> GRAFANA
    PROM --> GRAFANA
    TEMPO --> GRAFANA

    classDef source fill:#0c2d48,stroke:#38bdf8,color:#e0f2fe
    classDef collector fill:#1e293b,stroke:#334155,color:#e2e8f0
    classDef sink fill:#132a1f,stroke:#22c55e,color:#dcfce7
    classDef viz fill:#2d2513,stroke:#f97316,color:#fef3c7

    class CF source
    class POLL,FW,HTTP,METRICS,LOKIC,TRACE,SLOG collector
    class LOKI,PROM,TEMPO sink
    class GRAFANA viz

Data Flow

Poll Cycle

  1. The poll scheduler triggers on a configurable interval (default 5 minutes)
  2. Two collectors run in parallel within each cycle:
    • Firewall collector queries firewallEventsAdaptive for individual WAF events
    • HTTP collector queries httpRequestsAdaptiveGroups for aggregated traffic stats
  3. Each collector is wrapped in an OpenTelemetry span for end-to-end trace visibility

Firewall Events

  • Each event becomes a JSON log line pushed to Loki under {job="cloudflare", type="firewall"}
  • Event counts are tracked as Prometheus counters broken down by action type (block, challenge, allow)
  • Fields captured: action, client IP, host, method, path, query, ray name, rule ID, source, user agent, country

HTTP Traffic

  • Aggregated groups are pushed to Loki under {job="cloudflare", type="http_traffic"} as JSON
  • Request counts are exposed as Prometheus gauges labeled by method, status code, and country
  • Edge response bytes are tracked as a separate gauge

Observability

  • Prometheus: 9 metric families covering poll health, firewall events, HTTP traffic, Loki push status, and build info
  • Loki: Two structured log streams with distinct label sets for filtering
  • Tempo: Full trace per poll cycle with child spans for each API call and Loki push
  • Log-trace correlation: A custom slog handler injects trace_id and span_id into every log line, enabling one-click navigation between logs and traces in Grafana

Resilience

  • Both Cloudflare and Loki clients retry on transient failures (HTTP 429, 502, 503, 504) with exponential backoff up to 3 attempts
  • Retry-After headers are honored when present
  • On startup, the collector backfills up to the configured window (default 1 hour) to catch events from while it was down