As I mentioned last week, we broke the performance and load testing down into a number of different phases so that we could initially focus on one interface independently to see if there were any bottlenecks before involving multiple interfaces.
Performance Test 1
The first load test involved a large storm simulation that ran from 9am to 6pm and isolated the interface from the OMS to GSI Scout. This simulation included an automation of 61,000 customer calls generating 43,000 outages in the OMS and sending those outages to the GSI Scout System Bridge Web Service. We have the ability to scale up all of the GSI Scout services in less than a few minutes (databases, services, portals, etc.) and we would normally scale them all up to their highest settings for a large storm, but in this particular test we left them at a “normal” performance setting. In that 9-hour timeframe, the GSI Scout System Bridge Web Service responded to almost 52,000 messages from the OMS with an average response time of 348 milliseconds. It was able to keep up with the OMS with no problems at all.
We integrated a telemetry monitoring SDK into all of our cloud based components called Visual Studio Application Insights. Application Insights keeps track of every single request that was made of the service and logs information about those requests into a storage account on Microsoft Azure. Using the Microsoft Azure Portal you can begin to mine the results of a particular service allowing you to see if there were performance issues. This chart is one example of the information that can be provided by Application Insights. It provides an overview of the GSI Scout System Bridge Web Service showing the total number of requests, the dependency duration (which in this case is the database) and the response time for all of the requests. Each bar on this chart represents 15 minutes, and the average response times for the entire time period are listed to the right.
We’ve even integrated Application Insights into our web portal. In the portal, Application Insights monitors and records the performance of the client side code running in the user’s browser. It keeps track of every dependent call, such as AJAX calls retrieving additional data, so it can provide a wealth of information for performance tuning. It records any browser exceptions that may have occurred and keeps the call stack for that exception. It’s very nice information to have to troubleshooting with. But that’s a topic for another article.
As you can see in the chart, throughout the day the server response time held steady at about 350 milliseconds, which is a really good indication that performance isn’t degrading as the amount of data in the database increases significantly. The dependency duration starts to climb a bit toward the end of the day, but we’re only talking about a 1 millisecond increase per command over the course of a 9-hour time period with very high loading. Increasing the database scaling level to a higher setting would significantly reduce the average dependency duration and that would help with this slight increase seen toward the end of the day.
Overall the first performance test of the OMS to GSI Scout interface was successful, but this was an existing interface point that had been in production for several years and we knew that it already performed very well. This was validation that the changes we made to the GSI Scout System Bridge Web Services for the new interfaces points did not negatively impact the performance of this existing interface.
Performance Test 2
About 2 weeks after the first performance test, we executed the second performance test and added the interface from utility’s FFA application used by the field crews to GSI Scout, along with the existing OMS to GSI Scout interface that had been previously tested. This test would run for about 4 hours and simulate about 20,000 customer calls creating about 5,000 additional outages. The test would also simulate outages getting dispatched to field crews for assessment and those field crews “completing” the assessment of the outages in the field and sending the data for those assessments back to GSI Scout for processing. With this test the outages created with the previous performance test would already be in the database, so at the beginning of the performance test the GSI Scout database would already have about 51,000 outages in it. The net effect would be a large storm that had already resulted in 61,000 customer calls already taking place, and now adding an additional 20,000 customer calls generating another 5,000 outages. For this test we scaled all of the services (database, web services, portals, etc.) to the “moderate” performance level, which is right about the middle of the available performance spectrum. Still not the “large storm” performance settings we would normally go to, but a reasonable performance increase none-the-less.
At the end of the test, about 13,700 requests were processed on both GSI Scout System Bridge Web Services (FFA and OMS) with an average response time of 295 milliseconds.
From here we were actually able to break down the requests by interfaces and view the details at a lower level. The GSI Scout System Bridge Web Service handling requests from the OMS processed about 12,900 requests with an average response time of 278 milliseconds, considerably lower than the first test at 350 milliseconds. This was attributed to the fact that we scaled up the performance level of the GSI Scout services to the “moderate” level so processing time decreased by about 25%.
The GSI Scout System Bridge Web Service handling requests from the FFA application process about 813 requests with an average response of 566 milliseconds. This was the first time we had this particular interface involved and at a little over a half a second per request, it performed very well.
No requests failed, which was also a very good sign.
However, in this test a performance issue was discovered on the utility’s web service that processed messages from GSI Scout to FFA that started to cause some significant issues about 2 hours into the test. But that’s what these types of tests are for. While the test was in progress, after the performance issue had been identified, the developers of that web service were able to try out a remediation to the performance problem, as a proof of concept, that resolved the performance problem, this change would need to be incorporated into the web service before the next performance test.
Because of this issue, the automation script that generated messages from FFA to GSI Scout was turned off after only 813 requests were processed.
Performance Test 3
The third performance test was executed about a week later and included the fix for the performance issue discovered during the second test. The third performance test would involve both interfaces (OMS to GSI Scout and FFA to GSI Scout), but would focus more on the FFA to GSI Scout and GSI Scout to FFA interface points since those had not really been exercised as much in previous two tests. For this test no new outages would be created, but automations for dispatching outages to crews for assessment would be run, generating messages from GSI Scout to FFA and automations for completing assessments in the FFA tool would be executed generating messages from FFA to GSI Scout. The automations would run full speed, dispatching outages as quickly as possible and completing those outages as quickly as possible in an effort to find the maximum throughput of the two interfaces. The test would run for about for about 4 hours at full speed. Also, the GSI Scout database would start the test with all of the outages from the previous two performance tests, so there would be over 56,000 outages already in the database representing a significant amount of data, more than a large storm’s worth of data in fact. We scaled the GSI Scout services up to the “moderate” level, about midway up the performance scale.
During the test, 3,055 outages were dispatched and completed in the first 2 hours. The two GSI Scout System Bridge Web Services processed about 15,400 messages with an average response time of 318 milliseconds.
The GSI Scout System Bridge Web Service handling requests from OMS processed about 12,650 requests with an average response time of 160 milliseconds. This is considerably lower than the previous two test, but in this test no new outages were created. The only messages received through this web service in this test were “dispatch” messages, basically assigning outages to crews so the processing load for each request was much lower. This was easily seen in the average response time over the course of the test.
The GSI Scout System Bridge Web Service that processed messages from FFA handled about 2,750 requests with an average response time of 863 milliseconds. For this test, the interface received new outage assessment data from the FFA system so it had to update the outage in the database including all of the assessment data sent over from FFA. The processing requirements for each request are significant and the fact that the web service could handle each request in less than 1 second was very positive.
The test itself was a success. The previously identified performance issue in the FFA web service was resolved and did not cause any additional problems in this test.
The overall result of this test was that we were able to identify the upper limit of completed assessments per hour through both interfaces, which was 1,682. This was one of the key metrics that needed to be identified.
Based on some of our initial performance requirements, we needed to be able to process 1,200 assessment completions per hour, so we officially met this requirement. This would allow 300 assessors to complete 4 outage assessments every hour, continuously. Obviously this would be very hard to accomplish non-stop, but at least we now know that the system won’t fall over in that ever occurred. In fact, it turns out that we could handle a little more, about 400 outage assessment completions per hour more. The even better news is that this performance test started with over 56,000 outages in the GSI Scout database, which was a fairly large amount of data.
A Bonus Test
As icing on the cake, about a week later one of the operating companies ran another large storm simulation in their OMS and we ended up keeping the OMS to GSI Scout interface turned on. We left the GSI Scout database as is with over 56,000 in it and scaled the services up to the “moderate” level.
The storm simulation ran for 17 hours and created about 17,000 new outages. The GSI Scout System Bridge Web Service handling requests from OMS processed about 25,120 requests with an average response time of 314 milliseconds with no performance degradation identified.
After the test was complete the GSI Scout database had over 73,000 in it.
Wrapping It All Up
There is one more performance test scheduled later this month that will include all of the automation components on a large scale. Before this next performance test we plan on “resetting” the system by clearing out all of the existing outage data to simulate the true beginning of a large storm. Given that these GSI Scout interfaces have already performed very well with over 73,000 outages in it, we should be in pretty good shape.
The current plan is to start rolling these new interfaces out in a couple of months. User Acceptance Testing in already underway, in parallel with Performance Testing and both efforts seems to be going very well.
Performance and load testing is a key component to any large system integration project. It is very effort intensive, but the results of these tests are priceless. Any performance problems identified during performance testing will be further insurance against system failure under even the highest loading in the production setting. It certainly isn’t the real world where anything can happen, but it is a chance to account for the known criteria and make sure the system will respond positively to that criteria.