Post mortem on SELECT incident from August 15, 2024

  • Niall Woodward
    Co-founder & CTO of SELECT

Incident Summary

On Thursday August 15, 2024, certain users of SELECT received around 50 emails reporting account connection errors for one of our partner’s Snowflake test accounts.

Incident timeline

  • Starting at approximately 04:30 UTC, erroneous account connection emails started being sent to users.
  • At 6:45 UTC, the issue was identified and the email error notification process was disabled
  • At 7:07 UTC communication was sent to all impacted customers via email, and a pro-active in-app message was configured to notify users logged into the application. Our incident status page was also updated.
  • Some users continued to receive emails until 14:30 UTC due to rate limits on recipient mailboxes.

What happened?

A new email alerting mechanism was added to SELECT, with the intention of informing customers when a configuration issue was preventing SELECT from accessing Snowflake account metadata. Notifications for one of our implementation partner’s test accounts were incorrectly sent to the wrong email addresses.

Why did it happen?

A bug in our account error notification service resulted in the wrong account owners being notified of the failure. As a result, a larger volume of emails than expected were sent, which caused a timeout and multiple retries, triggering several emails to each individual.

The recipient mailboxes were rate limiting SELECT’s email sends, which caused emails to continue to be delivered for several hours into the day, after the bug responsible for the issue was resolved.

Why wasn’t this identified in testing?

Our staging environment did not have representative conditions of production for this case, which resulted in this issue being undetected until it was released.

Impact

The primary impact of this incident was confusion for our customers, many wondering whether one of their accounts had errors, whose account they were being emailed about, or whether this was a phishing attempt.

Contained within the email was the Snowflake account identifier for one of our partner’s testing accounts.

This incident was isolated to our account error notification service. Our core web application and authorization systems were not impacted.

Next steps and improvements

We will be improving our QA and staging environment to better represent production and introduce additional testing conditions:

  • Staged roll-outs of new email-based functionality to verify expected email volumes before going live
  • Guardrails around email rate limiting to restrict the volumes of emails that can be sent
  • Enhancing our integration test suite for the email notification service to cover additional scenarios