Friday, December 25, 2020

Build your own COVID-19 Data Analytics Leveraging Open Source - Serverless Fn & Autonomous Database

Firstly, a huge thanks to all the frontline workers, healthcare professionals, technology contributions, non-profit organizations and everyone else who are fighting the pandemic everyday amidst the risks involved and more importantly the sacrifices - including the most important thing that can never be retrieved - TIME.

As Oracle Cloud continues to aid COVID-19 vaccine research & trials and also help manage the COVID-19 vaccine distribution program, I came up with a simple (humble) contribution that helps analyze and report COVID-19 data.

Although there are several dashboards and reports readily available on the web, they cater to a specific region, locale & metric. For example, what if? one would like to understand the total COVID-19 cases reported in Belarus on/up-to a specific date sliced by total tests conducted, new cases that were tested positive, hospitalization ratio per million against a median age group?

Data is publicly available but making sense of it is the key. In this article, we will see how we can build our own COVID-19 analytics - leveraging simple & popular open source tools and technology with the help of public cloud (Eg. Autonomous Database in Oracle Cloud - that is auto-managed, auto-tuned & auto-patched) [PS: You can also replace Autonomous DB with MySQL]. What's more? - we can potentially achieve this for free with the Always Free cloud instance.

Let's quickly take a look at the high-level solution architecture;


Each component of this architecture is designed to be loosely coupled and can be replaced or further enhanced for functionality/convenience. For instance, I have leveraged Apache open-source serverless Fn project that natively runs on Oracle Cloud - this can be replaced with a Node.js or java code running on K8s-Docker container. Similarly, the Autonomous Database can be replaced with a MySQL DB 

Let's take a look at the key components of this solution;

1) Source dataset [Courtesy: Our World in Data]

"Our World in Data" offers COVID-19 data in a variety of data formats and most importantly offers daily updates - so we can keep our analytics up-to-date.

In our case, we get the dataset in csv format here

2) Python script deployed on Oracle Fn

I have leveraged the Oracle Cloud Functions (based on Apache Fn) serverless platform to deploy the simple python script to download the COVID-19 dataset into an object storage bucket.

The choice of Oracle Cloud Functions is particularly helpful in this case because I don't have to manage any infrastructure or deal with packaging the docker container and version control them. It lets me focus only on the business logic. Also, it supports a variety of programming languages natively including Python, Go & Java. Most importantly, it has built-in security and offers out of the box support for event notifications & triggers and ability to expose functions as APIs.

Pre-Req: 

Create a dynamic group for Oracle Functions and ensure you have a policy defined in your compartment/tenancy for Oracle Functions the ability to access / read csv in the object storage bucket

Instances that meet the criteria defined by any of these rules will be included in the dynamic group.

ALL {resource.type = 'fnfunc', resource.compartment.id = 'ocid1.compartment.oc1..abcdefgxyz'}

Allow dynamic-group FnDynamicGroup to manage objects in compartment sathya.ag

Let's create an Oracle Functions application;

Oracle Cloud -> Developer Services -> Functions -> Create Application & give it a name

Applications lets us group several Oracle Functions. To create a serverless function;

For quick setup, you can leverage the Cloud Shell under the Getting Started instructions to deploy the following python code. Oracle functions platform packages the code as a docker container, uploads the docker image to the default Oracle Cloud docker registry and automatically deploys it as a serverless function with an invoke endpoint.

import io
import json
import oci
import csv
import requests
import logging
import os
import urllib.request

from fdk import response

def progress_callback(bytes_uploaded):
    print("{} additional bytes uploaded".format(bytes_uploaded))

def handler(ctx, data: io.BytesIO=None):
    logging.getLogger().info("Got incoming request")
    signer = oci.auth.signers.get_resource_principals_signer()
    object_name = bucket_name = namespace = ""
    try:
        cfg = ctx.Config()
        input_bucket = cfg["input-bucket"]
        processed_bucket = cfg["processed-bucket"]
        input_csv = cfg["input-csv"]
        object_name = cfg["object-name"]
    except Exception as e:
        print('Missing function parameters: bucket_name', flush=True)
        raise
    logging.getLogger().info("before calling load data {0} {1} {2}".format(input_bucket,
 input_csv, object_name))
    
    
    logging.getLogger().info("download start!")
    filename, headers = urllib.request.urlretrieve(input_csv, filename="/tmp/covid.csv")
    logging.getLogger().info("download complete!")
    
    load_data(signer, namespace, input_bucket, filename, object_name)
    #move_object(signer, namespace, input_bucket, processed_bucket, object_name)

    return response.Response(
        ctx, 
        response_data=json.dumps({"status": "Success"}),
        headers={"Content-Type": "application/json"}
    )
    
def load_data(signer, namespace, bucket_name, input_csv, object_name):
    logging.getLogger().info("inside load data function {0} {1} {2}".format(signer, 
namespace, bucket_name))
    client = oci.object_storage.ObjectStorageClient(config={}, signer=signer)
    try:
        print("INFO - About to read object {0} from local folder and upload to bucket {1}
...".format(object_name, bucket_name), flush=True)
        namespace = client.get_namespace().data
        
        # Use UploadManager to do multi-part upload of file, with 3 Parallel uploads
        logging.getLogger().info("before calling uploadmanager")
        upload_manager = oci.object_storage.UploadManager(client, 
allow_parallel_uploads=True, parallel_process_count=3)
        response = upload_manager.upload_file(namespace, bucket_name, object_name, 
input_csv, progress_callback=progress_callback)
        logging.getLogger().info("response status {0}".format(response.status))
        if (response.status == 200):
            
            message = "Successfully  uploaded %s in bucket %s." % (object_name, 
bucket_name)
            return True

    except Exception as e:
        logging.getLogger().info("exception message {0}".format(e))
        message = " Image upload Failed  in bucket %s. " % bucket_name
        if "oci" in e.__class__.__module__:
            if hasattr(e, 'message'):
                message = message + e.message
            else:
                message = message + repr(e)
                print(message)
        return False

In the configuration section, provide the key value pairs that can be dynamically processed by Oracle Functions at invoke.

3) COVID Dataset in Object Storage

Let's verify if the COVID-19 dataset is successfully downloaded into our object storage bucket.

4) Loading COVID-19 dataset into Autonomous Database

Since we are leveraging Oracle Cloud Autonomous DB, we can leverage the OOTB cloud SQL package to load data from an external object storage bucket. Another variant of this approach could be to leverage the csv data in object storage as an external table and ADB lets you query on the csv data directly.

Again, the choice for Autonomous Database helps us focus only on loading and querying the dataset and not have to worry about the underlying infrastructure, patching & maintenance. I like to think of Autonomous DB as a "serverless" database.

Execute the DBMS_CLOUD.COPY_DATA procedure to load data from object storage into ADB. [Ensure you have the base table created to hold the COVID-19 dataset]

BEGIN
 DBMS_CLOUD.COPY_DATA(
    table_name =>'COVID',
    credential_name =>'JSON_CRED_AG',
    file_uri_list =>'YOUR OBJECT STORAGE BUCKET URI',
    format => json_object('type' value 'CSV', 'ignoremissingcolumns' value 'true', 
'delimiter' value ',', 'blankasnull' value 'true', 'skipheaders' value '1', 'dateformat' value 'YYYY-MM-DD')
 );
END;
/

5) Analytics Dashboard

Now that we have our data in a database, we can leverage any analytics / reporting engine to make sense of the data and generate dynamic reports. In our case, we leverage Oracle Analytics and/or Oracle Data Visualization.