Increased Latency
Incident Report for SmartyStreets
Postmortem

This incident was caused by increased traffic to our load balancers exceeding the configured maximum number of connections. This was the result of a steady and expected increase of traffic. The increase in the connection count was anticipated and planned for, but a very specific configuration value for the load balancing tier was incorrect and prevented traffic from going above the misconfigured threshold despite there being an overabundance of available CPU and network capacity to handle the load.

The settings remained the same from the previous OS to the upgraded OS, as anticipated. However, there were some subtle differences in the way the upgraded version of the operating system handled the settings and this inadvertently capped the number of available connections from our load balancer fleet to a lower value.

As a result, we are now keenly aware of the many settings that have to be configured and reconfigured. The load balancer fleet is now more robust and we have additional tests in place to verify the settings are correct.

Posted 6 months ago. Dec 17, 2018 - 21:17 UTC

Resolved
We have identified the root cause of the increased latency and we will be publishing more information soon on the incident.
Posted 6 months ago. Dec 12, 2018 - 23:31 UTC
Update
We are still actively monitoring the situation and working through available metrics and logs to fully uncover the root cause behind this incident.
Posted 6 months ago. Dec 12, 2018 - 18:58 UTC
Update
We are continuing to watch all available metrics to better understand the root cause of the latency spike.
Posted 6 months ago. Dec 12, 2018 - 16:34 UTC
Monitoring
Despite all of our internal metrics showing very low utilization of our system, we have brought significant additional capacity online and this has brought all external latency metrics to normal levels. We are now investigating what is happening in order to better understand the root cause.
Posted 6 months ago. Dec 12, 2018 - 15:50 UTC
Investigating
We are seeing increased latency across our load balancing fleet.
Posted 6 months ago. Dec 12, 2018 - 15:20 UTC
This incident affected: Account Management Portal, US Extract API (us-east, us-central, us-west), US Autocomplete API (us-east, us-central, us-west), US ZIP Code API (us-east, us-central, us-west), International Street API (us-east, us-central, us-west), and US Street Address API (us-east, us-central, us-west).