The Antivirus API

The Antivirus API exists as a separate app which doesn’t interact directly with other digitalmarketplace apps - rather it is an app which is designed to operate on files existing in S3 buckets and store the results of its scans back into the S3 bucket object’s “tags”.

The S3 scanning process operates on specific object versions rather than object keys. This simplifies things because the contents of an object-version are immutable so this reduces concerns around concurrent updates, stale scan results etc. It also ensures we don’t have any old infected versions of things lying about. Each object-version can however have an independent set of tags associated with it and these tags can be updated after the creation of the object-version. This is what is used to store the results of the virus scan and allows us to mark object-versions as “clean” or “dirty”.

Given an S3 bucket name, object key and object version id, a broad overview of the process the API will follow is:

  • First the object version’s tags will be checked to see if this object has already been scanned. If it has, the API won’t bother going any further. At time of writing there isn’t yet the ability to force a re-scanning of the file anyway, but this wouldn’t be hard to add.

  • The file’s contents are downloaded and passed to a clamd process running locally on the app.

  • The S3 object version’s tags are set according to the result:

    Tag key Tag value
    avStatus.result Either pass or fail
    avStatus.clamdVerStr Metadata about scanning process, including clamd and virus definition version
    avStatus.ts ISO timestamp of scan

    All avStatus.* tags are intended to belong together and describe a single scanning event, so an update of any of these tags should begin with all existing avStatus.* tags first being cleared out.

The actual procedure followed performs a few more steps than listed here in an attempt to reduce the possibility of race conditions arising from concurrent scanning requests.

How scans are initiated

There are two ways that scans are initiated for an object-version - using them together allows us to avoid the complexity of managing a stateful work queue while still making it quite hard for an uploaded object-version to “slip through the net” without being scanned.

Automatically triggered on new uploads via SNS

The Antivirus API receives HTTPS callbacks from Amazon SNS when a new object-version is uploaded to one of our buckets. Receiving one of these callbacks will initiate the scan procedure for the relevant object-version, however there are some caveats due to the fact that SNS is a notification system and not strictly a message queue.

  • Because the scanning procedure is performed during the callback’s request-handling cycle, it only has a limited amount of time in which to complete its work. An SNS callback allows a maximum of 15 seconds, after which time it will terminate the connection. In normal circumstances this should be plenty of time, so it’s unlikely to be an issue. Even if the procedure does end up taking more than the allowed time, the various layers of our web serving infrastructure don’t tend to propagate information about prematurely closed connections, so the actual procedure is likely to be given the full amount of time our wsgi server allows for response processing (typically 60s), long after SNS has hung up. In such a case, SNS will believe the notification attempt to have failed and will retry later. However, when retried the procedure should discover the object-version to have already been scanned and return quickly with a succesful response, ending the retries.
  • SNS will only retry delivery of a notification a limited number of times (currently 5), giving up after an hour. After this point, SNS will have effectively forgotten about this event and there is a danger that the upload in question could slip through the net. Because this is something we should probably be aware of, we’ve set up SNS to log delivery attempts to cloudwatch firstly as a permanent record, but also as a means to trigger an alert when SNS gives up on a notification. This is probably a sign that there is something seriously wrong with this mechanism.

To allow these SNS messages the Antivirus API has a non-IP-restricted URL path at https://antivirus-api.digitalmarketplace.service.gov.uk/callbacks mapped through to it by the router.

The catchup job

Because of the worry of missed SNS messages, we also have a “catchup” job running nightly to try and catch unscanned, recently created object-versions. As much as it would be nice to scan whole buckets for any unscanned object-versions, this is not practical due to the fact that it (seems to be) impossible to fetch object-version tags from S3 in bulk and separately requesting every set of tags for each object-version in a bucket on every run would be prohibitively slow.

The job, running every 24h, is set to look at object-versions uploaded in the last (just over) 48h, allowing for a single night’s run to fail safely without being corrected for before uploads could start being missed.

The actual job consists of a single script, virus_scan_s3_bucket.py, which sends requests to the Antivirus API’s “main” interface at https://antivirus-api.digitalmarketplace.service.gov.uk to perform the individual scans and report the results.

Actions on virus detection

The Antivirus API will:

  • tag the object-version as possibly infected in the S3 bucket
  • email the developers via the 2nd line mailing list
  • if run via the overnight catchup script, the Jenkins job will fail

No automatic attempt is currently made to rectify the situation - if no developer takes action, the file will stay in the S3 bucket as though nothing had happened. This is partly because of the complexity of performing some kind of file rollback in the face of possible concurrent scan requests for either the same version of an object or multiple versions of an object uploaded in quick succession. This is also done because it’s not always clear what the best action to take in all circumstances is and it’s thought that it’s probably better to trust a developer to use their judgement.

Developers are advised to refer to ClamAV’s Interpreting Scan Alerts FAQ to help determine the next action.

Functional/Smoulder tests

Because an actual virus alert is a rare occurrence, and the outward appearance of an antivirus system that isn’t working is basically the same as that of an antivirus system which simply hasn’t been exposed to any viruses, it’s nice to have some reassurance that the Antivirus API is actually functioning correctly and able to produce alerts. What’s more, this is an “operational” property rather than just being a property of the code in use - therefore we really want to be able to run this check in production to ensure that a specific environment’s scanning loop is currently operational.

We also would rather ensure the actual scanning engine is working instead of simply triggering it to send a mock “virus found” response. Because it’s a very bad idea to have a “real” sample virus kicking about our systems, even for this purpose, we upload an EICAR test file to generate the alert.

We have a smoulder test that ensures:

  • The SNS callback to the Antivirus API is operational.
  • The Antivirus API is able to access the file on S3 to scan.
  • The Antivirus API’s clamd is actually functioning and equipped with a working virus definition file.
  • The Antivirus API is able to send an alert email through Notify.

To prevent these tests producing actual alerts while still allowing us to check that a particular alert email was sent to Notify, the Antivirus API’s alert-sending code has a specific branch it takes when a file is identified as infected with EICAR (and only EICAR). In this case, the alert email is instead sent to a “dummy” email address and uses a Notify reference based on the file’s hash for unique identifiability of emails generated for a specific run.