Known limitations ¶
Overall, scaling paths are available for all CATS application and IndeVets data warehouse to any anticipatable business volume.
Performance ¶
The CATS application and IndeVets data warehouse have a very high ratio of business volume to system load. Usage grows very gradually and predictably under a closed ecosystem of users that includes only IndeVets management staff, IndeVets-employed doctors, and scheduling staff at onboarded client hospitals.
BI data analysis is isolated from CATS application usage by connecting Metabase and Microsoft PowerBI Gateway to a read-only PostgreSQL replica. With this, demanding analysis workloads can only impact other analysis workloads and never application users.
With NewRelic, we monitor how long user requests are taking over time and how wait times break down between the web application and database. Both the web application and database components have clear paths to adding capacity as increasing load brings performance near to our thresholds.
Database ¶
The primary database VM and read-only BI replica database VM are likely to become CPU- or memory-bound long before becoming storage-bound. Possible remedies to database performance bottlenecks include:
- Re-distributing heavy read workloads from the BI replica or primary database to additional read-only replicas
- Resizing either database VM to an instance size with more CPU/memory/storage
- Migrating either database to a physical machine or DBaaS
- Optimizing common queries
Web application ¶
The CATS web application is deployed to a Kubernetes cluster as a single Docker container that contains an nginx frontend web server, a PHP-FPM worker pool, a PHP queue worker daemon, and a scheduled job runner. This is ideal for development and management simplicity and economical for production as IndeVets’ normal business operations yield very modest web application loads relative to public websites based on similar technology. The production instance of CATS runs on a node pool that is shared with no other applications or environments. The production node pool contains two identically-sized nodes while the CATS application runs a single replica, leaving a second node idle and fully available for failover. Without any refactoring, possible remedies to web application performance bottlenecks include:
- Increasing VM instance size used in production node pool
- Tweaking PHP-FPM worker pool to maximize use of available CPU cores on the single production node
When it becomes necessary, CATS production deployment can be refactored to enable load to be distributed across more than one node:
- Session storage would need to be migrated from using Laravel’s
file
driver to one of its other plug-and-play session drivers. This would likely be to the existing CATS PostgreSQL database, an PostgreSQL database dedicated to sessions, a Redis instance, or encrypted cookie-based session storage - Scheduled jobs would need to be migrated directly to Kubernetes CronJobs where possible, and a minutely Kubernetes CronJob would need to be created to distribute invokations of CATS internal scheduled jobs across all running replicas of the CATS application, ensuring that matching jobs are only run once across the whole production replica set per matching minute.
- The queue worker already supports running as many parallel instances. It could remain embedded in the CATS application container so that each replica of the application contributes one queue worker, or extracted to an independend deployment that has its own replica count and potentially even runs on a different set of nodes so that queue jobs are isolated from competing with web request workers for CPU and memory.
- The nginx frontend could be left embedded within each CATS application replica, so that the cluster load balancer would distribute traffic among many nginx frontend application servers, or a single nginx frontend application server could be extracted to distribute traffic directly to many PHP-FPM worker pools. Operationally it will likely be simpler to keep each application replica having its own nginx frontend application server embedded.
Metabase ¶
The Jarvus team has successfully scaled Metabase for high usage simply by growing the replica count in Kubernetes. Without any refactoring, possible remedies to metabase performance bottlenecks include:
- Increasing replica count
- Moving metabase to a dedicated node pool
- Increasing VM instance size used in metabase node pool
Storage ¶
Database storage is currently limited to 60GB before an instance resize is required as we are using VM’s local disk storage instead of expandable managed volume storage for performance, management simplicity, and reliability. At the time of this writing, the CATS application database plus all IndeVets data warehouse tables total less than 2GB. Likely upgrade paths beyond 60GB include larger VM instance sizes or DBaaS solutions.