The Sprint NOC API —
Production Deploy
Before a single line compiles, project layout communicates intent. A well-structured Rust service separates concerns at the module boundary — database logic never bleeds into route handlers, authentication never tangled with business rules. Here is the complete file tree for the Sprint NOC API:
sprint-noc-api/ ├── Cargo.toml workspace manifest — all deps declared here ├── Cargo.lock committed to git — reproducible builds ├── .env.example template — never commit .env itself ├── Dockerfile multi-stage: builder → runtime ├── docker-compose.yml postgres + api + migrations ├── sprint-noc-api.service systemd unit for bare-metal deploy │ ├── migrations/ sqlx migrate run applies these in order │ ├── 001_create_alerts.sql │ └── 002_create_users.sql │ └── src/ ├── main.rs startup: config, db pool, router, listener ├── config.rs typed env-var extraction via envy ├── error.rs ApiError → HTTP response mapping ├── state.rs AppState struct shared across all handlers │ ├── models/ │ ├── mod.rs │ ├── alert.rs Alert, CreateAlert, UpdateAlert structs │ └── user.rs User, Claims, LoginRequest structs │ ├── db/ │ ├── mod.rs │ ├── alerts.rs list_alerts, create_alert, resolve_alert │ └── users.rs find_by_email, create_user │ ├── auth/ │ ├── mod.rs │ ├── jwt.rs encode_token, decode_token │ └── extractors.rs AuthUser, AdminUser FromRequestParts │ ├── routes/ │ ├── mod.rs build_router() — composes all routes │ ├── alerts.rs GET/POST /api/v1/alerts, PATCH /:id/resolve │ ├── auth.rs POST /api/v1/auth/login │ ├── health.rs GET /health — liveness + readiness │ ├── internal.rs POST /internal/alerts — Zabbix webhook │ └── ws.rs GET /ws/alerts — WebSocket upgrade │ └── zabbix/ ├── mod.rs └── webhook.rs ZabbixPayload → AlertEvent translation
Hard-coded connection strings and secrets are the single most common cause of production incidents. The service reads all configuration from environment variables at startup, failing fast with a clear error if anything is missing. The envy crate deserialises environment variables directly into a typed struct — no std::env::var scattered through the codebase.
# Database — PostgreSQL on stz-srv-01 DATABASE_URL=postgres://noc_user:CHANGE_ME@localhost:5432/sprint_noc # JWT — generate with: openssl rand -hex 64 JWT_SECRET=replace_with_64_hex_chars # Zabbix internal webhook secret (Zabbix → /internal/alerts) INTERNAL_TOKEN=replace_with_strong_secret # Server BIND_ADDR=0.0.0.0:8080 RUST_LOG=sprint_noc_api=info,tower_http=info
use serde::Deserialize; #[derive(Debug, Deserialize, Clone)] pub struct Config { pub database_url: String, pub jwt_secret: String, pub internal_token: String, #[serde(default = "default_bind")] pub bind_addr: String, } fn default_bind() -> String { "0.0.0.0:8080".to_string() } impl Config { pub fn from_env() -> Result<Self, envy::Error> { envy::from_env::<Config>() } }
The main.rs file is the service's checklist. It runs in order: parse config, initialise tracing, connect to the database, run pending migrations, build shared state, construct the router, bind the listener, and serve. If any step fails the process exits immediately with a clear error message. On an oil field, you do not guess — you verify at each checkpoint before proceeding.
use std::sync::Arc; use sqlx::postgres::PgPoolOptions; use tokio::net::TcpListener; use tokio::sync::broadcast; use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt}; mod auth; mod config; mod db; mod error; mod models; mod routes; mod state; mod zabbix; use state::AppState; use crate::routes::ws::AlertEvent; #[tokio::main] async fn main() -> anyhow::Result<()> { // ── 1. Config ────────────────────────────────────────────── dotenvy::dotenv().ok(); // load .env if present (dev only) let cfg = config::Config::from_env() .expect("Missing required environment variables"); // ── 2. Tracing ───────────────────────────────────────────── tracing_subscriber::registry() .with(tracing_subscriber::fmt::layer()) .with(tracing_subscriber::EnvFilter::from_default_env()) .init(); tracing::info!("Sprint NOC API starting — Kilimanjaro node"); // ── 3. Database ──────────────────────────────────────────── let pool = PgPoolOptions::new() .max_connections(20) .connect(&cfg.database_url).await .expect("Failed to connect to PostgreSQL"); // ── 4. Migrations ────────────────────────────────────────── sqlx::migrate!("./migrations") .run(&pool).await .expect("Failed to run database migrations"); tracing::info!("Database migrations applied"); // ── 5. Shared state ──────────────────────────────────────── let (alert_tx, _) = broadcast::channel::<AlertEvent>(256); let state = Arc::new(AppState { db: pool, alert_tx: alert_tx.clone(), config: cfg.clone(), }); // ── 6. Router ────────────────────────────────────────────── let app = routes::build_router(state); // ── 7. Listen ────────────────────────────────────────────── let listener = TcpListener::bind(&cfg.bind_addr).await?; tracing::info!("Listening on {}", cfg.bind_addr); axum::serve(listener, app).await?; Ok(()) }
Any load balancer, Docker health check, or Kubernetes probe needs a fast endpoint that signals service health. We distinguish two states: liveness (the process is running and the event loop is responsive) and readiness (the database pool can serve a query). The health route checks both, returning structured JSON. Isaac's dashboard can poll this endpoint; if it returns anything other than 200, alert the on-call engineer.
use axum::{extract::State, http::StatusCode, response::Json}; use serde::Serialize; use std::sync::Arc; use crate::state::AppState; #[derive(Serialize)] pub struct HealthResponse { status: String, database: String, version: String, } pub async fn health_handler( State(state): State<Arc<AppState>>, ) -> (StatusCode, Json<HealthResponse>) { // Probe the DB pool with a trivial query let db_ok = sqlx::query("SELECT 1") .execute(&state.db) .await .is_ok(); let (status, db_str) = if db_ok { (StatusCode::OK, "ok".to_string()) } else { (StatusCode::SERVICE_UNAVAILABLE, "unavailable".to_string()) }; (status, Json(HealthResponse { status: if db_ok { "ok" } else { "degraded" }.to_string(), database: db_str, version: env!("CARGO_PKG_VERSION").to_string(), })) }
Zabbix triggers a media type action: an HTTP webhook fires at POST /internal/alerts whenever a host goes down or recovers. The endpoint is protected by a static bearer token — a long secret configured in both Zabbix and the service's environment. This is not user authentication (that is JWT); it is service-to-service authentication using a shared secret, simpler and appropriate for a single trusted caller.
use serde::{Deserialize, Serialize}; /// Shape of the JSON Zabbix sends — configure in Media type → Message #[derive(Debug, Deserialize)] pub struct ZabbixPayload { pub trigger_name: String, // e.g. "Host unreachable" pub host_name: String, // e.g. "stz-sw-kilimanjaro-01" pub severity: String, // "High", "Disaster", etc. pub status: String, // "PROBLEM" | "RESOLVED" pub event_id: String, pub site: String, // custom macro {$SPRINT_SITE} } /// Map Zabbix severity → our internal severity pub fn map_severity(zabbix: &str) -> &'static str { match zabbix { "Disaster" | "High" => "critical", "Average" => "warning", "Warning" | "Info" => "info", _ => "info", } }
use axum::{ extract::State, http::{HeaderMap, StatusCode}, response::Json, }; use std::sync::Arc; use crate::{state::AppState, zabbix::webhook::{ZabbixPayload, map_severity}}; use crate::routes::ws::AlertEvent; use crate::models::alert::CreateAlert; pub async fn zabbix_webhook( headers: HeaderMap, State(state): State<Arc<AppState>>, Json(payload): Json<ZabbixPayload>, ) -> StatusCode { // ── Authenticate Zabbix ──────────────────────────────── let token = headers .get("Authorization") .and_then(|v| v.to_str().ok()) .and_then(|v| v.strip_prefix("Bearer ")) .unwrap_or(""); if token != state.config.internal_token { return StatusCode::UNAUTHORIZED; } // ── Determine event type ─────────────────────────────── let event_type = if payload.status == "RESOLVED" { "alert.resolved" } else { "alert.created" }; let severity = map_severity(&payload.severity).to_string(); // ── Persist to database ──────────────────────────────── let alert = crate::db::alerts::create_alert( &state.db, CreateAlert { site: payload.site.clone(), severity: severity.clone(), message: payload.trigger_name.clone(), }, ).await; let alert_id = match alert { Ok(a) => a.id.to_string(), Err(_) => "unknown".to_string(), }; // ── Broadcast to NOC screens ─────────────────────────── let _ = state.alert_tx.send(AlertEvent { id: alert_id, site: payload.site, severity, message: payload.trigger_name, event: event_type.to_string(), }); StatusCode::OK }
Media type → Webhook setup
In Zabbix UI: Alerts → Media types → Create media type. Type: Webhook. URL: http://stz-srv-01:8080/internal/alerts. Method: POST. Headers: Authorization: Bearer {your_INTERNAL_TOKEN}. Message body: a JSON template with macros like {TRIGGER.NAME}, {HOST.NAME}, {EVENT.SEVERITY}, {$SPRINT_SITE} (a host-level macro you define per host group). Add the media type to an action trigger and it will fire on every problem and recovery.
use axum::{ routing::{get, post, patch}, Router, }; use std::sync::Arc; use tower_http::{ cors::{CorsLayer, Any}, compression::CompressionLayer, trace::TraceLayer, }; use crate::state::AppState; pub mod alerts; pub mod auth; pub mod health; pub mod internal; pub mod ws; pub fn build_router(state: Arc<AppState>) -> Router { let api = Router::new() .route("/alerts", get(alerts::list_alerts).post(alerts::create_alert)) .route("/alerts/:id/resolve", patch(alerts::resolve_alert)) .route("/auth/login", post(auth::login)); Router::new() // Public health check — no auth .route("/health", get(health::health_handler)) // Internal webhook — token auth only .route("/internal/alerts", post(internal::zabbix_webhook)) // WebSocket — authenticated via AuthUser extractor inside handler .route("/ws/alerts", get(ws::ws_handler)) // REST API — JWT auth enforced per-handler via extractors .nest("/api/v1", api) // Middleware stack applied to everything .layer(TraceLayer::new_for_http()) .layer(CompressionLayer::new()) .layer(CorsLayer::new().allow_origin(Any)) .with_state(state) }
The multi-stage Dockerfile solves a problem specific to compiled languages: the build tools (rustup, 1.4GB of LLVM, the entire crates registry) are needed to compile but must never appear in the production image. Stage one — the builder — installs all tooling and compiles a statically-linked binary. Stage two — the runtime — is a minimal Debian image that receives only the binary and the migrations directory. The final image is under 80MB.
# ── Stage 1: Builder ─────────────────────────────────────────── FROM rust:1.78-slim-bookworm AS builder # System deps for sqlx (needs OpenSSL headers for postgres TLS) RUN apt-get update && apt-get install -y \ pkg-config libssl-dev \ && rm -rf /var/lib/apt/lists/* WORKDIR /app # Cache dependencies before copying source # Copy manifests first — Docker layer cache reuses this unless Cargo.toml changes COPY Cargo.toml Cargo.lock ./ RUN mkdir src && echo 'fn main(){}' > src/main.rs RUN cargo build --release RUN rm -f target/release/deps/sprint_noc_api* # Now copy real source and build COPY src/ src/ COPY migrations/ migrations/ RUN cargo build --release # ── Stage 2: Runtime ─────────────────────────────────────────── FROM debian:bookworm-slim AS runtime # Runtime deps only: ca-certificates for TLS to PostgreSQL RUN apt-get update && apt-get install -y \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Non-root user — never run services as root RUN useradd -ms /bin/bash sprint USER sprint WORKDIR /app # Copy the compiled binary and migrations from builder COPY --from=builder /app/target/release/sprint-noc-api ./sprint-noc-api COPY --from=builder /app/migrations ./migrations EXPOSE 8080 CMD ["./sprint-noc-api"]
version: '3.9' services: db: image: postgres:16-alpine restart: unless-stopped environment: POSTGRES_DB: sprint_noc POSTGRES_USER: noc_user POSTGRES_PASSWORD: ${DB_PASSWORD} volumes: - pg_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U noc_user"] interval: 5s retries: 5 api: build: . restart: unless-stopped depends_on: db: condition: service_healthy ports: - "8080:8080" environment: DATABASE_URL: postgres://noc_user:${DB_PASSWORD}@db:5432/sprint_noc JWT_SECRET: ${JWT_SECRET} INTERNAL_TOKEN: ${INTERNAL_TOKEN} RUST_LOG: sprint_noc_api=info,tower_http=info healthcheck: test: ["CMD-SHELL", "curl -sf http://localhost:8080/health || exit 1"] interval: 10s timeout: 5s retries: 3 volumes: pg_data:
If you are deploying directly on stz-srv-01 without Docker — which is entirely reasonable for a single-server NOC API — systemd is the production process manager. It handles restarts on crash, enforces resource limits, isolates the process from the rest of the system, and logs to journald. The unit file is a specification: start this binary, with these environment variables, restart on failure, never run as root.
[Unit] Description=Sprint NOC API — SprintTZ Kilimanjaro After=network.target postgresql.service Requires=postgresql.service [Service] Type=simple User=sprint Group=sprint WorkingDirectory=/opt/sprint-noc-api ExecStart=/opt/sprint-noc-api/sprint-noc-api # Load secrets from a file not tracked in git EnvironmentFile=/etc/sprint-noc-api/env # Restart policy Restart=on-failure RestartSec=5s # Security hardening NoNewPrivileges=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/opt/sprint-noc-api/logs # Resource limits LimitNOFILE=65536 # Logging — view with: journalctl -u sprint-noc-api -f StandardOutput=journal StandardError=journal SyslogIdentifier=sprint-noc-api [Install] WantedBy=multi-user.target
BARE-METAL DEPLOY SEQUENCE — stz-srv-01 ──────────────────────────────────────────────────────────── # Build on your dev machine (cross-compile for Ubuntu 24.04 x86_64) cargo build --release scp target/release/sprint-noc-api sprint@stz-srv-01:/opt/sprint-noc-api/ # First deploy: set up env file and migrate ssh sprint@stz-srv-01 sudo mkdir -p /etc/sprint-noc-api sudo nano /etc/sprint-noc-api/env # paste secrets, chmod 600 # Run migrations (once, idempotent) ./sprint-noc-api --migrate-only # or use sqlx-cli directly # Install and start service sudo cp sprint-noc-api.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable --now sprint-noc-api ● sprint-noc-api.service - Sprint NOC API — SprintTZ Kilimanjaro Loaded: loaded (/etc/systemd/system/sprint-noc-api.service) Active: active (running) since Fri 2026-05-01 09:31:00 EAT # Rolling update scp target/release/sprint-noc-api sprint@stz-srv-01:/opt/sprint-noc-api/sprint-noc-api.new ssh sprint@stz-srv-01 "mv /opt/sprint-noc-api/sprint-noc-api{.new,} \ && sudo systemctl restart sprint-noc-api" # Zero downtime: old process drains connections before new starts
[package] name = "sprint-noc-api" version = "0.1.0" edition = "2021" [dependencies] # Web framework axum = { version = "0.7", features = ["ws"] } tower = "0.4" tower-http = { version = "0.5", features = ["cors", "compression-gzip", "trace"] } # Async runtime tokio = { version = "1", features = ["full"] } # Database sqlx = { version = "0.7", features = ["postgres", "runtime-tokio-rustls", "uuid", "chrono", "macros"] } # Auth jsonwebtoken = "9" bcrypt = "0.15" # Serialisation serde = { version = "1", features = ["derive"] } serde_json = "1" # Config envy = "0.4" dotenvy = "0.15" # .env loading in dev # Error handling anyhow = "1" thiserror = "1" # IDs and timestamps uuid = { version = "1", features = ["v4", "serde"] } chrono = { version = "0.4", features = ["serde"] } # Logging tracing = "0.1" tracing-subscriber = { version = "0.3", features = ["env-filter"] } [profile.release] opt-level = "z" # size optimisation — smaller binary for scp lto = "thin" codegen-units = 1 strip = "symbols"
Before declaring the system live, walk through the complete alert path manually. This is the equivalent of a pilot's post-start checklist — you confirm every system is functional before you leave the ground.
END-TO-END VERIFICATION SEQUENCE ──────────────────────────────────────────────────────────── ## 1. Health curl http://stz-srv-01:8080/health {"status":"ok","database":"ok","version":"0.1.0"} ## 2. Login — get a JWT curl -X POST http://stz-srv-01:8080/api/v1/auth/login \ -H 'Content-Type: application/json' \ -d '{"email":"[email protected]","password":"..."}' {"token":"eyJ0eXAiOiJKV1Q..."} ## 3. Open a WebSocket in another terminal wscat -c ws://stz-srv-01:8080/ws/alerts Connected (press CTRL+C to quit) ## 4. Simulate a Zabbix alert curl -X POST http://stz-srv-01:8080/internal/alerts \ -H 'Authorization: Bearer your_INTERNAL_TOKEN' \ -H 'Content-Type: application/json' \ -d '{ "trigger_name": "Host unreachable: stz-sw-serengeti-01", "host_name": "stz-sw-serengeti-01", "severity": "High", "status": "PROBLEM", "event_id": "99001", "site": "Serengeti" }' ## 5. Watch the WebSocket terminal — event arrives within milliseconds < {"id":"a3f9...","site":"Serengeti","severity":"critical", "message":"Host unreachable: stz-sw-serengeti-01", "event":"alert.created"} ## 6. Confirm persistence curl http://stz-srv-01:8080/api/v1/alerts \ -H 'Authorization: Bearer eyJ0eXAi...' [{"id":"a3f9...","site":"Serengeti","severity":"critical",...}] All six steps passing = system is live. ✓
Rate Limiting the Webhook
A misconfigured Zabbix action can fire thousands of alerts per minute, hammering the database. Add rate limiting to the /internal/alerts route using Tower's ServiceBuilder with a rate_limit layer, capping at 60 requests per minute. When the limit is exceeded, return 429 Too Many Requests with a Retry-After header set to the number of seconds until the window resets. Log every rate-limit rejection with the source IP address.
Alert Acknowledgement
NOC engineers need to acknowledge alerts — marking "I am aware and investigating" without resolving them. Add a new PATCH /api/v1/alerts/:id/acknowledge endpoint. Extend the alerts table with an acknowledged_by column (nullable UUID, foreign key to users), an acknowledged_at timestamp, and an acknowledged boolean. The endpoint should require authentication, record the acknowledging user, and broadcast an alert.acknowledged event to WebSocket clients so the NOC screens can update the alert's visual state immediately.
The Four-Site Dashboard
Build the NOC wall screen that Isaac sees when he walks into the Kilimanjaro operations room. It is a single HTML file served statically by the API (add a ServeDir layer). The screen connects to /ws/alerts on load and maintains a live table showing alerts grouped by site: Kilimanjaro, Serengeti, Drakensberg, Rwenzori. Each site card shows the count of active alerts and the highest severity. When a new alert.created event arrives, the relevant site card flashes and the alert appears. When alert.resolved arrives, the row disappears with a brief fade. On disconnect, show a reconnecting indicator and retry with exponential backoff. No frameworks — plain HTML, CSS, and the WebSocket API are sufficient and produce a faster, more maintainable result.
From the Oil Field
to the Operations Room
You started this book without knowing what a borrow checker was. You now own the mental model of memory safety from first principles, you have wired real hardware, you have spoken I2C to a servo controller and quadrature-decoded an encoder shaft, and you have shipped a WebSocket API that can fan a single database write out to forty screens in the time it takes light to cross a fibre strand.
The thread running through every chapter is the same discipline you learned in the field: verify before you proceed, make failure visible, let the type system hold the invariants you cannot afford to check at runtime. Rust's ownership model is not a language quirk — it is that discipline encoded in syntax.
The SprintTZ NOC team in Dar es Salaam now has infrastructure worth understanding. The system you have built runs on real fibre crossing real borders, surfaces real alerts from real Zabbix hosts, and pushes them to real screens in real operations rooms. That is not a tutorial project. That is the craft.
Keep building. Keep reading datasheets. Keep listening to the hardware.