Decisions Needed Before Cutover¶

Questions that the Rails team can't answer alone. Each one blocks or shapes a chunk of work. Surface to leadership / DevOps lead before scoping the next sprint.

1. Cloud target: Azure or AWS?¶

The original Senith email mentioned Azure. Some current infra is on AWS (S3, possibly more). What's the actual target?

Azure (AKS) — implies Azure Database for MySQL, Azure Cache for Redis, Azure Key Vault, Azure Container Registry, optional move from S3 to Azure Blob.
AWS (EKS) — implies RDS MySQL, ElastiCache, AWS Secrets Manager, ECR, stay on S3.

This decision drives every subsequent infra choice.

2. Object storage: stay on S3 or migrate to Azure Blob?¶

If the cluster runs on Azure but storage stays on S3, we get cross-cloud egress costs but zero application changes. If we move to Blob:

Add azure-storage-blob gem, change ActiveStorage service to :azure_storage.
Run a one-time copy of all existing S3 objects to Blob (size and count TBD — pull from S3 inventory).
The raw aws-sdk usage in workers (PDF, packing slips, shipping labels, TIFF, PNG) also needs to switch to the Azure SDK. Non-trivial.
ActiveStorage URLs in the DB encode the service — the service_name column will need to be flipped. There is a Rails-supported migration pattern for this; not free.

Recommendation: stay on S3 for the initial lift-and-shift, revisit blob migration as a separate project.

3. MySQL 5.7 → 8.0 timing¶

Production version: TBD (need to confirm — dev compose uses 5.7). MySQL 5.7 is EOL. The new managed DB should be 8.0.

Run the test suite against 8.0 in CI before cutover.
Plan a mysqldump + restore against an 8.0 staging DB and run a smoke test pass.
Likely a brief downtime window during cutover.

4. Redis: managed or in-cluster?¶

Managed (Azure Cache / ElastiCache) — operationally simple, predictable cost.
In-cluster (Bitnami Redis Helm chart) — cheaper at small scale, another thing to operate.

OMS uses Redis for Sidekiq queues + (probably) Rails cache. Recommendation: managed. Sidekiq + Redis is a hot path; outsource the ops.

5. Secret store strategy¶

How does RAILS_MASTER_KEY (and other secrets) reach pods?

Plain k8s Secrets committed via SealedSecrets — simple, but secrets live in Git encrypted.
External Secrets Operator + Azure Key Vault / AWS Secrets Manager — secrets stay in the canonical store, ESO syncs them into k8s Secrets. More moving parts but proper separation.

Recommendation: ESO + Key Vault / Secrets Manager.

6. CI/CD pipeline¶

Today: Capistrano from a developer machine. Going to k8s, we need a build-and-push pipeline + a rollout mechanism.

Build: GitHub Actions builds the image, pushes to ACR / ECR.
Rollout:
Argo CD (GitOps) — manifests in a Git repo, Argo reconciles. Best practice for production k8s.
kubectl set image from CI — simpler, no Argo to operate.
Helm + helm upgrade from CI — middle ground.

Recommendation: Argo CD if infra has appetite for it; Helm-from-CI if not.

7. Filesystem write call sites — refactor before cutover, or `emptyDir` patch?¶

Four call sites write to local disk (see known-issues.md). Refactoring all four to stream to S3 is real work. The cheap path is an emptyDir mount, but only safe if a single job execution owns the path (no cross-pod coordination needed).

Action item for Rails team: audit each call site and confirm whether the path needs to be visible to other workers or only consumed within the same job. Refactor anything that crosses pod boundaries.

8. Datadog Agent topology¶

Two patterns:

DaemonSet (one agent per node) — standard Datadog k8s recommendation. App pods talk to status.hostIP:8125.
Sidecar (one agent per app pod) — higher resource cost, simpler network model. App pods talk to localhost:8125.

Recommendation: DaemonSet.

9. Sidekiq capsule deployment topology — RESOLVED¶

config/sidekiq.yml defines three capsules in one YAML. Today, a single Sidekiq process runs all of them. For k8s, we wanted three separate Deployments (different scaling, resource, autoscale rules).

Resolved: the Rails team has split the config into four per-Deployment files. Each Sidekiq Deployment runs one capsule:

config/sidekiq.default.yml → sidekiq-default Deployment
config/sidekiq.limited.yml → sidekiq-limited Deployment
config/sidekiq.single.yml → sidekiq-single Deployment
config/sidekiq.scheduler.yml → sidekiq-scheduler Deployment

The original config/sidekiq.yml is unchanged so the current Capistrano deploy keeps working.

10. Static egress IP¶

Several integration partners (carriers, 3PL) may require source-IP allowlisting. The k8s cluster needs a NAT gateway with a stable public IP.

Action item for infra: confirm the NAT plan and the public IP. Action item for Rails team: produce the partner list that needs the IP.

11. Domain & TLS¶

Hostnames per environment (production, staging, sandbox, sandbox2)? Migration from *.popsockets.com to new hostnames or keep the existing?
TLS issuer: cert-manager + Let's Encrypt, or Azure-managed certs?

12. Cutover plan¶

Run k8s and Capistrano deploys in parallel for some period?
Single big-bang cutover with downtime window?
Read-only canary (route a small % of traffic via the k8s ingress)?

This is a leadership call based on risk tolerance and downtime budget.

Suggested order of decisions¶

Cloud target (Azure vs. AWS) — blocks everything else.
Storage path (stay on S3 vs. migrate to Blob) — biggest scope swing.
MySQL 8.0 timing — independently can start now.
Static egress IP plan — partners need lead time.
Datadog topology, secret store, CI/CD — can finalize during build-out.
~~Capsule topology — Rails team owns this; can land before infra is ready.~~ Resolved (see #9).
Cutover plan — last, once everything else is concrete.