Skip to content
Build bf98f58

Workloads

The OMS app is not a single process. The container image is the same for every workload below — only the command differs. Each row should map to a distinct k8s Deployment (or Job / CronJob) so they can be scaled, restarted, and resourced independently.

Deployments

Name Command Replicas Notes
web bundle exec puma -C config/puma.rb 2+ (HPA on CPU) HTTP on 3000. Probes hit /up. Tune via WEB_CONCURRENCY (workers) and RAILS_MAX_THREADS (threads). Pool size in config/database.yml is currently 30 — make sure DB max-connections covers replicas × workers × threads.
sidekiq-default bundle exec sidekiq -C config/sidekiq.default.yml 2+ Default capsule, concurrency 5, queues urgent, high, default, low. SIDEKIQ_SCHEDULER_ENABLED=false.
sidekiq-limited bundle exec sidekiq -C config/sidekiq.limited.yml 1–2 Concurrency 4, queues nav, urgent, high, default, low. The nav queue is exclusive to this Deployment.
sidekiq-single bundle exec sidekiq -C config/sidekiq.single.yml 1 (do not autoscale) Concurrency 1, queues camel, urgent, high, default, engine. The camel queue handles Cirro 3PL exports and PrintStation order creation — these jobs are not safe to run concurrently. replicas: 1 enforces cluster-wide single-thread processing.
sidekiq-scheduler bundle exec sidekiq -C config/sidekiq.scheduler.yml 1 (singleton, strategy: Recreate) SIDEKIQ_SCHEDULER_ENABLED=true. Hosts the sidekiq-scheduler thread. If two run, every scheduled job fires twice. Schedules are stored in the DB-backed RecurringJob model — see scheduled-jobs.md.

About per-capsule config files: the original config/sidekiq.yml defines all three capsules together (used by the current Capistrano deploy). For k8s, the Rails team has split it into four per-Deployment files (sidekiq.default.yml, sidekiq.limited.yml, sidekiq.single.yml, sidekiq.scheduler.yml) so each Deployment runs exactly one capsule and can be scaled and resourced independently. The original sidekiq.yml is unchanged so the Capistrano deploy is unaffected.

Jobs

Name When Command Notes
db-migrate Pre-deploy hook bundle exec rails db:migrate Runs against the target DB before web/worker pods roll. Use a Helm pre-upgrade hook or Argo Sync hook. Must complete before new pods start serving.
assets-precompile CI build-time bundle exec rails assets:precompile Runs in the CI image-build pipeline (GitHub Actions), not in this repo's Dockerfile. The CI step passes the production RAILS_MASTER_KEY via BuildKit secret (--mount=type=secret) so the key never lands in an image layer. See the comment in docker/production/Dockerfile for the exact pattern.

CronJobs (optional alternative to sidekiq-scheduler)

The current design uses sidekiq-scheduler with schedules in the DB. We can keep that (singleton scheduler pod) or convert recurring jobs to k8s CronJobs. Tradeoffs:

  • Keep sidekiq-scheduler: schedules live in the DB, business users edit via UI, no k8s changes per schedule change. Single point of failure (the scheduler pod), but the queue still drains if it briefly restarts.
  • Move to k8s CronJobs: visibility in kubectl get cronjob, native retries/concurrency policy, but every schedule change is a k8s manifest change. Loses dynamic-via-DB editing.

Recommendation: keep sidekiq-scheduler for now (no business workflow change), revisit later.

Resource profile (TBD — needs production data)

We need to pull real numbers from Datadog before sizing. Placeholder targets based on typical Rails/Sidekiq:

Workload Request CPU Limit CPU Request RAM Limit RAM
web (per pod) 500m 1500m 768Mi 1.5Gi
sidekiq-default 300m 1000m 1Gi 2Gi
sidekiq-limited 250m 800m 768Mi 1.5Gi
sidekiq-single 200m 600m 512Mi 1Gi
sidekiq-scheduler 100m 300m 384Mi 768Mi

Action item for Rails team: pull P95 CPU/RAM per process from the last 30 days of Datadog and replace the table above before final rollout.

Termination grace periods

  • web: terminationGracePeriodSeconds: 30 — Puma drains in-flight requests on SIGTERM.
  • sidekiq-*: terminationGracePeriodSeconds: 60 — must be ≥ Sidekiq's :timeout: 25 (set in every per-capsule config file) plus margin. If we discover jobs that take >25s, raise :timeout AND the grace period together.

Pod disruption budgets

  • web: minAvailable: 1
  • sidekiq-default / sidekiq-limited: minAvailable: 1
  • sidekiq-single / sidekiq-scheduler: do not set a PDB requiring

    0 — they are 1-replica singletons, a PDB blocks node drains.