MachinaNova 2.0: News Recommendations for All!
Earlier this year, I was compelled by a request from C-K's data and engineering lead (Mo Omer) to contribute to a technology meetup that inspires employees to realize how data, technology and innovation can inspire them to think in different ways.
This inspiration made me realize that my previous innovations should be brought into 2020 to find new life. Especially those innovations showcasing the power of machine learning in consumer experiences.
This lead me to MachinaNova. A solution that in the past reduced the amount of effort required to find news that interested me. I knew it could be better, faster and achieve more for more people. Inspired by this call to action, I got to work upgrading the original news recommendation engine, MachinaNova.
By upgrading to version 2.0, I hoped to achieve two things:
Expand the solution to enable friends and family to find news that interests them.
Reduce the cost of hosting this solution to a sustainable amount.
Let's be real. My inexperience in cloud hosting lead to costly mistakes. MachinaNova (on AWS) would run at a cost of nearly $90 per month. Way too expensive for me to keep up and running. With a little bit of research, I was able to find a new solution for the upgraded MachinaNova 2.0 that cost <$5 per month.
Let's dive into the re-imagined, slimmer, cheaper and considerably more handsome version of MachinaNova. Version 2.0
News aggregation is an awful experience.
A little bit of background on the original inspiration first... My daily news habits are a sight to behold... It consists of grabbing breakfast, coffee and sitting down to make my way through a jungle of digital news and content. My goal: emerge on the other side of the news jungle with a tidbit of information I didn't already have; and hopefully some of the coffee and bagel I plan to eat lands in my mouth.
To be real, I click literally dozens of bookmarks, scan headlines and eventually give up, surrendering to click fatigue. Every morning, I end up on the tried and true Harvard Business Review... because they're just great every time.
I needed a solution that can recommend the articles I'm interested in and find content that can further my career.
MachinaNova News 2.0 | Solution Overview
MachinaNova 2.0 is an application that finds news that the user is interested in; and presents it in a sorted format from most to least interesting multiple times per day. In my case, MachinaNova predicts the likeliness that I will "Like" an article by reading it with Natural Language Processing (NLP.) Along with the prediction, it also produces a confidence score that tells users how confident it is that they are going to like the article. Each day, it presents the top 28 "Liked" articles sorted by its confidence in descending article. To prevent an echo chamber, and expand user's horizons, it also includes six of the latest news chosen at random to preserve some wide coverage of the day's events.
So, how does the solution know what users want to read?
It's magic... A magical, news recommendation system.
MachinaNova is a web app with a built in classification algorithm. The classifier recognizes articles that the user likes to read and presents them. And the algorithm refreshes six times daily with the latest and greatest news of the day.
I'll break down the solution into a few sections so it's easy to understand, but at a high level, MachinaNova is a recommendation system built on Linux, Apache, Django and Postgresql. It learns what you like because you tell it by "training" it. Using this training, it pre-reads the articles and presents what is likely an article you'll like based on the words and subjects talked about within the title and summary of the article.
First stop: The Article Database
The postgresql database currently contains nearly 50,000 articles (as of July 2020) pulled via a Python script that uses freely available RSS feeds to grab the article link, headline, journalist, date of publication and a brief summary of the article. Each article is cleaned and transformed to better fit some of the downstream processing; but retains the core pieces. After transforming, the article is loaded into the database. This occurs six times per day with key updates occuring in the morning and evening - prime time for the U.S. news cycle.
The library I use to extract data from the RSS feed is the - more than capable - "feedparser" built by Kurt McKee. See the latest on the library at GitHub: https://github.com/kurtmckee/feedparser
Here's what a typical article looks like in the database:
The Article database contains articles from the following RSS feeds today (July 2020); but I've built a way to add new RSS feeds via the Django web application as I find them.
Journal of the American Medical Association (for a friend in medicine)
Bay News 9 (for a friend in Tampa Bay)
Wall Street Journal
Digiday | Advertising Top News
Dzone | Data Zone
Adexchanger | Advertising News
Harvard Business Review
MIT Technology Review
MIT Sloan Management Review
Verge Marketing and Technology
The Huffington Post
The New York Times
National Public Radio
The Trainer App
With articles in the database, a user needs the ability to tell the recommendation engine about their preferences. Typically, companies like Google, do this passively using web behavior (like click through or searches in a search engine); but my web app is not coded to capture that information today.
Ain't nobody got time for that...
Instead of passive feedback, this part of the Django application is a simple interface that shows a random untrained article on a user's computer screen. The user is presented with two buttons "more like this" and "less like this." With a simple press of the button, a 0 or 1 is stored along with the user ID so that a classifying algorithm can process those ratings to learn individual user preferences. This easy solution allows a user to interact, securly with the database, giving the MachinaNova scoring algorithm the data it needs to make predictions.
After a few glasses of wine, and a few nights of scoring, a user can add hundreds of articles; or they can do a little at a time. Whatever fuels their wonderful training experience!
The Scoring Algorithm
So, how does the algorithm understand what kind of news I like?
To understand the content of the article, MachinaNova reads the article using Natural Language Processing (NLP.) NLP is similar to what I might do during my morning routine. I scan the source, read the headline and maybe a summary of the article. In my routine I move on to the next article if I don't like it and maybe click to read if I do like it. The only difference is, MachinaNova can read and score thousands of articles in seconds.
But that's just the surface of what NLP does for us in this process...
With this version of the blog, I thought I'd dive a little deeper into the recommendation algorithm. The actual algorithm is a simple "bag of words" model using the feed source, headline and summary of the article (see image above for a summary of the features used to extract the natural language.) Each article is classified according to the user's trained article sources (0's and 1's from the trainer.)
The algorithm of choice is a support vector machine. I used a variety of kernels; but ended up with a linear kernel due to performance. The support vector machine from Sci-kit learn has a function called "predict_proba" that provides the probability of the two outcomes "more like this" and "less like this." This feature is extremely useful for sorting how likely a user will like the articles for a given day.
My algorithm works like this:
Extract all of a user's scored articles. If a user has not scored at least 25 articles, then this scoring for that user will not occur for the day.
Create a "bag of words" by using Spacy.io's natural language model. Each word, part of speech and entity are counted by the "bag of words" process creating a dictionary of scored articles for each user. (NOTE: Spacy's entity recognition is pretty good at recognizing named items like Apple the company in the context of a new article rather than apple the fruit.)
Using a Support Vector Machine, learn how each user scores articles and store that trained algorithm to predict how unscored articles will be classified.
The last 350 articles in the database are then scored by the trained SVM. Each new article gets the classification (1 for "more like this" and 0 or "less like this") along with the probability of that particular score.
Steps 1 through 4 are repeated for each user in the database. This process is fairly fast; but has not been optimized as of today (July 2020.)
Schedule this repeating schedule of python scripts in our Linux environment to run up to 6x per day.
There are a TON of possible improvements to this algorithm. A team of developers could spend a year making it better...
One of the better suggestions are to create a two stage ensemble method that improves on "bag of words" by layering in a data reduction step to identify topics rather than words. This recommendation may be taken in future updates as I believe the accuracy of the model will improve. As of today, I'm fairly satisfied with the articles recommendations I receive from Version 2.0. My other users tend to disagree, though - so this improvement could improve their satisfaction.
The Presentation Layer
At this point we have a postgresql database powering a web framework that allows our users to train an algorithm to predict news articles they'll want to read. We also have a series of production rss import scripts followed by a machine learning training and prediction pipeline that updates users on the latest and greatest news they're interested in.
Now it is time to present the information to the end user with the experience updated six times daily and thousands of scored news articles each day.
I love Django and the Materialize CSS!
During this project, I fell in love with a super easy-to-use software that allows me to build beautiful web applications using my favorite language, Python. The solution, Django, creates a database, gives me some security features and allows me to bring in templating and CSS to make my application sharp, updated and integrated with the data I'm collecting.
So, the solution I built is ready and available on my website at machinanova.news.
Multi-tenancy and other Considerations
I'm super proud to complete version 2.0 of this project. My skills have grown considerably; and this new version I can share with my friends and family. If you're interested in using MachinaNova, please send me a note or ask for access in the comments section below. I'm pretty confident you'll find it a timesaver!
The code name of the original project was Project Quito. Unfortunately, Quito came off of my travel list due to the Covid-19 pandemic. Next year, I'll get to the Galapagos.