Thomas
August
Ryan

Work Posts Résumé

Malicious Scraping and Account Number Schemes

On Monday morning just before 11 AM my supervisor walked into my cube. “Hey Tom, did you change anything about the voting precincts query lately?” Nope, I hadn’t changed that thing in months. We walked over to his office where he was remoted into the VM that one of our databases runs on. Task Manager was showing 99% CPU usage. Emails from customers reporting the uncharacteristic slowness of Parcel Details, my web app, started pouring in.

My supervisor started the troubleshooting process by running a command against our MS-SQL server instance that reported what queries were taking the most resources to run. This report showed that a query that looked up voting precincts in my app was eating up an order of magnitude more resources than any of the other queries that this database was executing. As it turned out this was a bit of a red herring, but it eventually led us down the right path.

Next, we reviewed the database logs to look for errors or memory paging, of which there were none. Then we reviewed the logs for a related web app that my supervisor is responsible for supporting which showed that we were experiencing a minor traffic spike, approximately double our normal volume for this day of the week. Still concerned about the extreme CPU usage on the database server we poked around in the Windows event viewer for a few minutes to no avail.

Finally, I returned to my desk and opened the Application Insights dashboard for Parcel Details. It was immediately clear that something was very wrong. The average page load time was 1.5 minutes and 14k failed requests had been reported in the last 24 hours. Server response times had degraded to 1.2 minutes per request, 10k exceptions had been thrown, and 6.5k requests to dependencies had failed. I head straight to the Failures dashboard to investigate.

Three kinds of exceptions had been recorded: database request timed out, account number not found, and cannot open database connection. Looking at the first exception there wasn’t much useful information in the stack trace other than confirming that something had gone wrong with a SQL call to the overwhelmed database from earlier. But luckily for me App Insights had captured the URL of the request to our web server that had triggered this exception.

I fired up a copy of the app from Visual Studio 2019 and slapped the offending URL onto the end of the localhost route. Immediately I was able to reproduce the sluggishness that the users were reporting as the app took over a minute to bump me to an application exception page. Clearly this was a case of not failing out fast enough.

Inspecting the URL more closely I realized that the value for the ?parcel= query string was invalid. There are two formats for valid account/parcel numbers one is fourteen characters long and the other is seven characters long. Neither contain non-numerical characters, although there is a special case for a variant of the 14-digit version which includes dashes. The query string value that was causing this exception contained multiple letters.

I went to the controller for this page and added a small chunk of input validation code that redirected requests with query strings containing characters other than digits and punctation back to the default search page rather than running them against the database. This change would short circuit the lookup for these accounts that don’t exist and spare the database the hassle of pointlessly trying to look it up.

I ran the testing suite and did a bit of manual verification testing in IIS Express to verify that the fix worked and nothing had regressed. Then I shipped the changes to the staging environment where I again manually verified that the fix worked before pushing this build to production. I waited for a few minutes for something terrible to happen and when nothing occurred, I left to walk around the block for a few minutes. Upon returning I refreshed the dashboard and saw that the telemetry from the database and web server had returned to normal. This incident was resolved in less than an hour.

I wrote up a short postmortem describing the attack and sent it to my supervisor and the customers. Then I started working on this blog post.