Chicago Analytics Microservices, v1: A data engineering solution built with Go, Kubernetes, and AWS
Updated: Dec 31, 2021
Github Project: https://github.com/agentdanger/chicago-data-public
This blog covers a demonstration of a microservices deployment in Go Lang leveraging Kubernetes, Docker, Github, and PostgreSQL as the data lake.
So let's dive in!
As data engineers, our objective is to build end-to-end solutions that capture raw data, clean it, and prepare it for data scientists and analysts who will add value by applying algorithms, data visualization and statistical processes to uncover insight.
Therefore, this application is strictly designed as data delivery; and will not focus on the work analysts, algorithms, or tools add to the process. Our solution will assist journalists, students, Chicago analysts, and data scientists as they interact with the city's data and produce their own insights.
The architecture diagram at the end of this document describes my initial design. We’ll leverage microservices to ingest and clean our data before storing it in our data base. On the “front end” we’ll have a series of microservices that my users can call to request data in a predefined format. That predefined format can evolve over time based on user feedback and/or new requirements.
This architecture is useful because I don’t need to give analysts direct access to a database. I can democratize data to as many people as I can so more people can use, analyze and learn from the City of Chicago's open data.
This obfuscation is useful for commercial application as well as it demonstrates a way we can deliver clean, useful reports to end users without provisioning direct access to databases or a company's internal infrastructure.
PostgreSQL in a Kubernetes Architecture
Our first challenge when dealing with a Kubernetes (K8s) clusters is the persistence of the
microservices you're developing. K8s' container based application architecture gives you great flexibility in spinning up resources to complete work associated with your data engineering service. For a lot of applications, you want to perform a job on demand - so that idea works great. In the case of a database, you can't simply start and stop your database - it needs to be accessible by all of your services to store or access data at all times.
K8s solves this by allowing you to spin up a persistent volume (and a persistent volume claim) that is accessible all the time within the cluster. The PV and PVC can be allocated to our PostgreSQL database to keep it available to all of our other microservices.
PostgreSQL Persistent Volume YAML file:
PostgreSQL Persistent Volume Claim YAML file:
Deployment and Service for PostgreSQL
Next, we need to create YAML files that deploy our PostgreSQL database (so that it will automatically restart if/when any of our nodes fail; and provide an internal service that allows other application running in our cluster to access the database. This is accomplished by using the deployment and service API in K8s.
PostgreSQL Deployment YAML file:
PostgreSQL Service YAML file:
PostgreSQL Secret YAML file:
The secret file contains passwords that our PostgreSQL database and other applications can use. These secrets are encrypted and only accessible within the cluster by Kubernetes containers that we provision them to. The following example shows the structure of a properly formatted secrets file you can use as a template for your project.
Deploying a PostgreSQL database in Kubernetes:
To deploy the database in a Kubernetes cluster, we create them using kubectl in the following order:
kubectl apply -f <name_of_file>
example-secrets-file.yaml (secrets must be in place before deployment)
postgres-db-pv.yaml (persistent volumes are claimed before deployment)
postgres-db-deployment.yaml (the database is created in this step)
postgres-db-service.yaml (services are deployed after the database is created)
City of Chicago Open Data Sources
In 2012, then Mayor Rahm Emanuel signed an executive order that paved the way for open
data sources available from the City of Chicago. Our goal with this project is to further process this raw data to make it useful for data scientists and analysts to use out of the box.
For volume 1 of this project, I've hand selected 8 data sources that I think are useful for further analysis. I'll build data ingestion and ETL services designed around these 8 sources:
City of Chicago 311 Service Requests
City of Chicago Business Licenses
City of Chicago Crime 2001-present
City of Chicago Taxi Trips
City of Chicago Traffic Crashes
City of Chicago Transportation Network Provider Trips (Uber, Lyft, etc.)
Chicago Transit Authority "L" Station Entries
Chicago Transit Authority List of "L" Stops
Let's see what one of these services (Chicago Crime 2001-present) looks like in Go, and how it is deployed in Kubernetes.
Crime Data Ingest Microservice (API Call Service)
The purpose of the ingest microservice is to touch the City of Chicago's APIs and return data to our PostgreSQL database. Once in our database, we can further process that data and deliver it to other microservices within our architecture.
In my example API call service, we are calling the City of Chicago's API and expecting it to return a JSON document. That JSON document is not in the format we need to store in PostgreSQL, so we'll transform it on the fly and then write that data to our database. Let's break this down into key steps.
Go's Struct - Adding structure to the JSON we collect from API services.
The first quality we need to overcome from our API call, is that the data is returned in JSON format. JSON is an unstructured data type - which is useful for computers - but less useful
for humans/analysts. Since analysts will want some structure to the data (likely in the form of a table of rows and columns) we want to transform the unstructured JSON from our API call into a structured format - mainly a table in PostgreSQL.
We can accomplish this in Go by using structs. A Go struct is a way for data engineers to define a collection of fields they want to keep as a single unit. By defining a struct, we have a temporary, structured format we can place the data from the JSON we want to keep before writing it to our database. Each declared item in our struct represents a column in our final table. Each JSON object represents a row in our data. We can process each JSON object, storing the object in the struct, and then writing that struct as a row in our PostgreSQL database.
Notice that I use pointers for fields that will contain NULL values. This highlights the need to perform EDA on our data sources to understand how to best structure our API calling service.
Crime data struct example:
Writing to our PostgreSQL database.
Once we have a struct defined, we can make a call to our API and return the response. In Go, we unmarshall that response into our defined struct and write each record into our database. The complete process is defined below.
Note: My final script also includes table definition for the first run of this script.
Full script, below:
Deploying the service to Kubernetes as a CronJob
Now that we have a Go script that can call the City of Chicago's crime API, we want to run this service every few minutes to check for new data and store it in our database. This is accomplished in Kubernetes by creating a Docker container that can run our script, and scheduling a CronJob to run our Docker container every few minutes.
Docker: creating a self-contained application that can be run anywhere.
Before we deploy our service as a cronjob in Kubernetes, we need to create a Docker image. The Docker image is a self contained application that can be spun up by Kubernetes and run. After completion, Kubernetes can spin down those computing resources and allocate
them to other services in our cluster. Let's briefly touch on how we created and deployed our Docker crime service as a Docker Image.
Our image is created with instructions embedded in a Dockerfile. The image, which is private, is called usfinthere/crime_data_service. Docker Hub access is provisioned within my Kubernetes cluster using this process: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
Our docker image is called by Kubernetes within our CronJob and built on the cadence we specify. In this case, we build the image and run it every 4 minutes.
Crime Data Reporting Microservices
Now that we have data being extracted, transformed, and loaded into our PostgreSQL database every 4 minutes - we now want to create services that add value to that data. We then want to deliver the final result to analysts and data scientists can use to access the data in a predefined format.
Microservice 1: Creating useful new features in the Chicago Data.
Our API services are not useful unless we create new features and reports that cater to analysts and data scientists. For example, the original Chicago Crime data doesn't include the ZIP code as a feature, but does contain the latitude and longitude of each block that a crime was committed on.
With the LAT and LONG, we can leverage a microservice to process the data and add the ZIP code to the original data. I won't discuss the details of that microservice in this blog, but I wanted to address that new value is being added to the original data - creating a unique data set that analysts will find useful. That service was built in Python and is also deployed within our Kubernetes cluster and runs weekly.
Microservice 2: Crime Data Reporting API.
Finally, we need to deploy a service that can accept a request from an analyst and deliver data in a pre-defined format. The reporting API services will handle this process. First, we create a useful view of our data; and finally we deploy a service to listen for requests and return data in processed through a Go struct then converted to JSON.
Full details can be found here, but I highlight how to create views in Go below.
Highlight 1: View creation with Go
Conveniently, we can write SQL and execute those scripts in Go. In our application, we can build the view that presents data in a useful way. We can create several different views of the data and return data in that format for analysts.
Here's an example of a single view created to represent crime details:
Deploying the API in Kubernetes
Deploying the API Service in Kubernetes
Once our service is live, analysts can pull data in JSON format and use it to perform analysis. Below is an example of how to use the crime data service and the types of reports that are now available.