Skip to content
Build bf98f58

OMS Migration Plan: AWS EC2 → Azure AKS

Draft — not a finalized plan

This page captures the considerations, risks, and a rough phasing for moving OMS off AWS EC2 and onto Azure Kubernetes Service. Owners, dates, and phase commits are TBD pending review. See Open questions at the bottom.

Motivation

Consolidate to a single cloud. The rest of the integrations stack — Camel / cm-* services, BatchStation, Service Bus, ADX — already lives in Azure. Keeping OMS on AWS means dual-cloud networking, dual billing, and dual ops surface for one of the busiest services in the pipeline.

The win is operational, not feature-driven: unified secrets, unified observability, fewer egress charges, and a deploy story that looks like the rest of the integrations org instead of a Capistrano one-off.

Current state

App Rails 8.1.3 on Ruby 3.4.8
Hosts 4× EC2 (popsockets-large-4..7)
Deploy Capistrano + Passenger
DB MySQL — location TBD (confirm before sequencing)
Cache / queue backing Redis (Sidekiq)
Logs semantic_logger to disk, DD agent on host
Secrets / certs Rails encrypted credentials; NAV cert at /home/ubuntu/nav_cert/star.popsockets.com.ca-bundle.pem
External integrations SFCC OCAPI/SCAPI · NAV SOAP · Azure Service Bus · BatchStation/PrintStation · Cirro (via Camel)

See OMS for the full breakdown.

Target state

Compute Containerized Rails web + Sidekiq workers on AKS
Image registry Azure Container Registry (ACR)
CI GitHub Actions builds + tags images, deploys via kubectl / Helm
Secrets Azure Key Vault, mounted via Secrets Store CSI driver
Logs stdout/stderr, scraped by DD daemonset
Scaling HPA on web (CPU / RPS) + a separate scaler on Sidekiq (queue depth, e.g. KEDA)
DB / Redis Azure-resident — Azure Database for MySQL Flexible Server (or Azure SQL) + Azure Cache for Redis

Decisions on DB flavor, AKS cluster sharing, and registry layout still need to be made — flagged below.

Key considerations

Database locality

Single biggest correctness/perf risk. Every query made by Rails is round-tripped to MySQL — if the DB stays in AWS while compute moves to Azure, every request and every Sidekiq job pays cross-cloud latency plus egress. The DB move has to be sequenced into this project, not deferred to a later phase. Treat it as a blocker for prod cutover.

Until the DB location is confirmed (RDS? self-managed EC2? somewhere else?), we can't plan the migration path or estimate downtime.

Filesystem dependencies

OMS leans on a few on-disk artifacts that don't exist in a stateless container:

Today Target
NAV cert at /home/ubuntu/nav_cert/star.popsockets.com.ca-bundle.pem Key Vault secret, mounted via CSI
Capistrano shared/ (uploaded CSVs, packing slips, EDI staging artifacts) Azure Blob Storage with SAS-scoped access
Nav::Api writing to Rails.root/log/{env}.log via ActiveSupport::Logger stdout (DD daemonset picks it up)
Rails master key on host Key Vault

Inventory pass needed before Phase 1 — anything writing to local disk has to be re-routed before containerizing or it'll either crash or silently lose data.

Sidekiq draining

K8s default terminationGracePeriodSeconds is 30s. NAV SOAP jobs and large bulk actions routinely exceed that. Pods need a longer grace period plus a preStop hook that calls Sidekiq's quiet → drain sequence and only exits once the in-flight set is empty. Without this, rolling deploys will half-process jobs.

Recurring jobs

Schedule is DB-driven (recurring_jobs table) and read by Sidekiq-Scheduler. In a multi-replica deployment that read happens N times — if each replica registers its own schedule, jobs fire N× per cron tick.

Two paths:

  • Leader election — only one replica owns the schedule (e.g. via a lease or the sidekiq-cron gem with leader flag).
  • Single-replica scheduler deployment — split scheduling out of the worker pool entirely; one dedicated pod, separate from the autoscaled worker pool.

Decide before Phase 2 — the wrong choice silently double-fires SyncOrdersJob and friends.

Egress IP whitelisting

SFCC OCAPI/SCAPI and NAV are IP-allowlisted at the source. AKS pod egress isn't a stable IP unless we configure it (NAT gateway with a static public IP, or a User Defined Route through a fixed firewall). The new egress IPs need to land on SFCC and NAV allowlists before cutover, not during, since the SFCC/NAV admin turnaround is measured in days.

Cron / out-of-band schedules

What's scheduled outside Sidekiq today — whenever, raw crontab, systemd timers? — has to be inventoried. Targets in the new world: k8s CronJob for ops-side things, sidekiq-cron for app-side things. The inventory pass is part of Phase 0 discovery.

Service Bus consumer cutover

The most fragile cutover step. If both the old AWS-side Sidekiq and the new AKS-side Sidekiq are consuming Service Bus topics at once, fulfillment messages will be processed twice (Cirro acceptance, NAV release callbacks, batch updates). Two viable approaches:

  • Coordinated handoff — scale AWS workers to 0, wait for in-flight to drain, scale AKS workers up. Brief lag in message processing.
  • Topic pause — pause publishers (or hold the consumer's session lock) during the swap. More moving parts, but no message-processing gap.

Either way, this happens once at cutover and needs a runbook.

Cost shape change

Today: 4 fixed EC2 boxes. Tomorrow: pod requests/limits + two autoscalers (web by CPU/RPS, Sidekiq by queue depth). Savings come from off-hours scale-down on the worker pool; surprises come from misconfigured limits causing OOMKills or throttling. Plan a soak with realistic peak load before declaring a steady-state cost.

Build & deploy

Multi-stage Dockerfile: a build stage that runs bundle install + asset precompile, a runtime stage that's slim (no build toolchain, no node). Migrations run as a pre-deploy k8s Job so they finish before new pods serve traffic, not as a Capistrano hook. CI pipeline: GitHub Actions → ACR push → kubectl set image (or Helm upgrade).

Observability

DD's k8s integration handles autodiscovery, kube_state_metrics, and container metrics out of the box, but tag continuity matters: keep service:ruby so existing dashboards and monitors don't go dark on the cutover. Same with the per-order log streams — preserve the log format so search queries still match.

Cert / secret rotation

NAV cert lives in Key Vault, with a documented rotation procedure (who pulls the new cert, who pushes it to Key Vault, what the redeploy looks like). Same shape for SFCC client secrets and Service Bus SAS keys. The current "edit credentials.yml.enc and redeploy" flow becomes "update Key Vault and roll the pods."

MySQL migration: AWS → Azure

Target is Azure Database for MySQL — Flexible Server (managed; not running in pods). SKU sized to match the current AWS instance class.

Approach: native MySQL replication. Dump/restore alone takes hours of downtime on a real-sized OMS DB. Azure DMS is no longer Microsoft's recommended path for online MySQL — they've shifted toward native replication. App-level dual-write is too invasive for a Rails app this size. So replication it is.

Steps

  1. Provision the Flex Server. Match AWS source on major version (5.7 vs 8.0), charset, collation, sql_mode, and timezone (UTC).
  2. Seed with mysqldump --single-transaction --master-data=2 --gtid from AWS, restore into Azure. The --master-data=2 annotation captures the binlog coordinates we need for replication.
  3. Wire the network path so Azure can pull binlogs from AWS. Two options:
    • AWS RDS public endpoint + IP allowlist + TLS — simpler, fine for a 1–2 week migration window.
    • VPC ↔ VNet peering — more secure, more setup.
  4. Start replication on Azure at the captured binlog position. Watch Seconds_Behind_Master.
  5. Pre-cutover prep: bump AWS RDS binlog retention to ~7 days, confirm ROW format, smoke-test reads against Azure.
  6. Cutover (low-traffic window, Sunday AM):
    1. Pause Sidekiq queues, put Rails in read-only/maintenance.
    2. Wait for replication lag → 0.
    3. Stop replication on Azure — it's now primary.
    4. Flip DATABASE_URL → Azure.
    5. Restart app pods.
    6. Smoke test, then re-enable Sidekiq.
  7. Hold AWS read-only for 48h as a rollback option.

Gotchas

  • AWS auto-purges binlogs. Bump retention before starting or replication will fall off the back of the log.
  • TLS is enforced on Azure MySQL Flex. Make sure Rails' database.yml has sslmode: required before cutover.
  • Connection limits are tier-dependent. Count Rails web pool size × pod count + Sidekiq pool size × concurrency × pod count, then pick a tier that fits with headroom. (Rough today: web ≈ 2 pods × pool; sidekiq-default ≈ 1–3 pods × concurrency 5; sidekiq-camel ≈ 1 pod × 1.)
  • sql_mode defaults differ between AWS and Azure. Align them up front or expect surprise validation errors after cutover.

Open questions before scheduling

  • Is the current MySQL on AWS RDS, or self-hosted on the VPS?
  • MySQL version — 5.7 or 8.0?
  • Rough data volume and writes/day (sizes the SKU and sets cutover-window expectations)?
  • Any cross-account / IAM-auth dependencies that don't survive the move?

Phased rollout

Rough cut. Expect every phase to bend on contact with reality.

Phase 0 — Discovery

  • Confirm MySQL location (RDS? something else?)
  • Audit filesystem dependencies (every path written to under Rails.root or /home)
  • List external IP allowlists (SFCC, NAV, anything else)
  • Inventory non-Sidekiq scheduled work
  • Pick AKS cluster (new vs share ps-usw-aks-01)

Phase 1 — Containerize

  • Multi-stage Dockerfile that boots Rails locally
  • docker-compose covering app + MySQL + Redis for dev parity
  • GitHub Actions builds and pushes images to ACR

Phase 2 — AKS dev environment

  • Helm chart or kustomize overlay for OMS
  • Dev/sandbox cluster runs against existing dev integrations
  • Smoke test: place an order through the test pipeline end-to-end

Phase 3 — Data layer migration

  • Stand up Azure-resident MySQL + Redis
  • Cutover plan for prod DB (replication, freeze, swap, verify)
  • Decide DB flavor — see Open questions

Phase 4 — Staging cutover + soak

  • Run staging on AKS for long enough to surface scaling, scheduling, and log-shape issues
  • Validate alerts, dashboards, on-call runbooks against the new shape

Phase 5 — Production cutover

  • Egress IPs added to SFCC / NAV allowlists ahead of time
  • Service Bus consumer handoff per Service Bus consumer cutover
  • Decommission EC2 hosts

Open questions

For Kevin to sort:

  • Timeline. Is this a Q3 plan, an end-of-year plan, or further out?
  • DB destination. Azure Database for MySQL Flexible Server, Azure SQL (with adapter changes), or self-hosted in Azure?
  • AKS cluster. New cluster, or share ps-usw-aks-01 with the existing integrations services?
  • Prod replica count. What's the pod count target — match the 4 EC2 boxes, or sized differently for the new shape?
  • Cert migration owner. Who pulls the NAV cert, pushes it to Key Vault, and owns rotation going forward?
  • Whitelist lead time. Can we get the IP allowlist asks queued with SFCC and NAV early, before Phase 5?
  • OMS — current architecture and ownership
  • Camel Topology — the existing Azure side of the integrations stack