bug - daily email interruptions ¶
CATS-710
background ¶
DailyOpenShiftEmail messages are sent out to doctors at midnight.
problem ¶
The emails stopped being delivered to everyone around the same time as the Laravel 7 and 8 upgrades. The pod holding the prod container was getting a 137 kill message, and being restarted at random points during the queuing process, effectively stopping the process.
There was no specific error message logged, and testing out some code changes didn’t have any effect on the terminations.
That this seemed to be introduced around the same time as the Laravel
upgrades potentially points to some deeper change in how
processing happens internally, but there may have been some other
environmental configuration changes that happened. Potentially
even the size of the emails themselves getting larger may have
pushed some process past a tipping point.
The underlying issue is that when a problem occurs, the entire pod is killed and restarted without warning. This occurs very quickly, end users won’t generally notice anything. But it does stop long running processes and truncates the laravel.log
resolution(s) ¶
There are two areas of resolution.
docker change ¶
docker-entrypoint.sh was patched to only attempt to restart the queue processing code if an error occurs while processing queued jobs (like this mail), keeping potential errors limited to the scope of the job queue running being restarted, as opposed to the entire pod.
This was an emergency change pushed to production the night of Jan 18, and the midnight Jan 19 email run appeared to run successfully.
internal code change ¶
In EmployeeService::getOpenShiftsForPublishedDates, while looping over found shift IDs, original code was creating a shift object and then calling ‘loadIncentives’ on that. On production, this became measurably slow. A list of 150 emails took around 7-8 minutes to loop and generate. (This was not the speed locally in development).
The code was changed to avoid the shift object retrieval, and a portion of the loadIncentives() method from the shift class was inlined directly, reducing the loop overhead. 150 emails on production took under 2 minutes after this change.
This speed concern was not contributing to a crash.
This inline change was tested earlier and the stopping
crash still happened, only happened sooner.