In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android). High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs. The time developers have to wait for their test results is longer than ideal.
Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests. However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches. This is because many people land on try before pushing to other branches, and tests aren’t coalesced on try.
The work to determine the cause of high pending counts is always an interesting mystery.
- Are tests being chunked into smaller jobs that increase end to end time due to the added start up time?
|Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0|
Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem. We
|Increase in seconds that new jobs added to the total compute time per push. (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)|
Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives: