MachinaNova: A Personal News Recommendation Engine
Updated: Nov 21, 2019
UPDATE: This story was written in 2018. Since that time, I have turned off MachinaNova in an attempt to save on compute and storage bills on AWS. It was a fun project, and still lives on as a local application on my home computer.
Original article follows...
A few months ago, I purchased a subscription to a tool called Paper.li an application I could add to my website. Its promise was to "collect relevant content and deliver it where ever you want" - which is exactly what I need during my work morning.
Let me explain... I love my morning routine. It consists of grabbing breakfast, coffee and sitting down to make my way through a jungle of digital news and content. My goal: emerge on the other side of the news jungle with a tidbit of information I didn't already have; and hopefully some of the coffee and bagel I plan to eat lands in my mouth.
To be real, I click literally dozens of bookmarks, scan headlines and eventually give up surrendering to clicking fatigue. Every morning, I end up on the tried and true Harvard Business Review... because they're just great every time.
If Paper.li does what they say they can do, it will cut time off of my routine, and I'll finally be able to enjoy my bagel instead of shoving it in my face. I NEEDED THIS SOLUTION!
The news aggregation reality: It sucks bagels.
Frankly, the Paper.li solution was terrible. It was effective at recommending articles; but rarely were they interesting. Usually, the recommended articles were blog posts or tweets of folks promoting their personal projects and rarely could I find any real content I could apply to my career.
It's not that Paper.li (and other news aggregators - looking at you, Flipboard, Medium and every other solution out there) was bad. It just wasn't personally tailored to content I wanted to read. Which is fine if you're not a perfectionist like me...
However, I am a perfectionist... and I have the skills to solve this problem.
It was time to build MachinaNova.
MachinaNova, the news machine | Solution Overview
MachinaNova is an application that finds news I'm interested in and presents it in a format I can read every day. It begs the question... How does it know what I want to read?
It's magic... A magical, news recommendation system.
Well, it's actually software I trained to recognize articles I like. I'll break down the solution into a few sections so it's easy to understand, but at a high level, MachinaNova is a recommendation system.
In my case, MachinaNova predicts the likeliness that I will "Like" an article by reading it. Along with the prediction, it also produces a confidence score that tells me how confident it is that I'm going to like the article. Each day, it presents the top 12 "Like" articles sorted by its confidence.
The solution is relatively simple, and super helpful. I've been using it for a few weeks now - and I LOVE IT!
Thank you, MachinaNova! You actually know what I like.
To understand the content of the article, MachinaNova reads the article using Natural Language Processing. I'll get into what that is in a second, but think of it as similar to what I might do during my morning routine. I scan the source, read the headline and maybe a summary of the article. In my routine I move on to the next article if I don't like it and maybe click to read if I do like it. The only difference is, MachinaNova can read and score thousands of articles in seconds.
I'm jealous... I might read thousands of articles a year... The AI is taking over.
MachinaNova, the news machine | Gathering News
To build MachinaNova, the first thing I needed to do was grab articles I like. Fortunately, there's a pre-built way to easily capture news from all over the web. RSS Feeds. RSS Feeds are services that news providers offer to allow for quick distribution of their content. In an RSS Feed, you can find things like the name of the article, the author, a link, pictures, etc.
To capture the valuable RSS data, I built a Python script that calls my favorite RSS feeds and stores the information in a database on my local machine named Ocean. (The name Ocean is because of its 3TB SSD hard drive.)
Ocean has been happily capturing articles since June 2018. At the time of this article, my local database contains over 20,000 articles from sources like the Wall Street Journal, Chicago Tribune and Harvard Business Review - to name a few.
As part of the ultimate MachinaNova solution, I needed to replicate the Postgres database on Amazon Web Services. I also needed to replicate my Python script in AWS Lambda and use Cloudwatch services to grab data from the RSS feeds and store them in my Amazon Postgres RDS solution.
In parallel, Lambda also stores logos from the RSS feeds in Amazon S3 so MachinaNova web application can present those images as the source to the end user.
My local Python script was retired in early October 2018 after I migrated the Python Script and Postgres dB to AWS.
MachinaNova, the news machine | Web Application
After building a way to capture news articles, I needed to tell my application what articles I like and don't like. On my local computer (Ocean), I built a Django application that reads the articles I've stored in Amazon RDS.
This Django application, which is not available to the public, is an interface that shows a random untrained article on my computer screen. In the app, I can see two buttons "Like" and "Dislike." I select "Like" if I like the article and "Dislike" if I don't. In my database it stores "0" for "Dislike" and "1" for "Like."
After a few glasses of wine, and a few nights of scoring, I have approximately 1,200 articles trained based on my preferences.
Now, we have the beginnings of a true data science project - machine learning!
MachinaNova, the news machine | Machine Learning
This part could be its own article. So, I'm going to keep it at a very high level. This is a family friendly blog post.
Remember earlier I said that my solution can read the articles using Natural Language Processing. To do that, MachinaNova uses a solution called Spacy to "read" each word (and punctuation.) Spacy comes with built in understanding of what words can mean in context of other words that are nearby (example "Apple" is a company not a fruit when accompanied by other business words.)
Using Spacy and its NLP capabilities, I built a bag of words model that counts the number of words used in an article, headline and even the source. It then looks at those words and compares it to the training data set for common words and word counts for articles that I've liked in the past.
Essentially, MachinaNova looks for articles with words that are similar to articles I've liked before.
To score each article, I use an algorithm called a Support Vector Machine to represent those words in "N" dimensions using a kernal trick in order to draw a line between articles I "Like" and "Dislike." This worked okay at first, but needed to be optimized. To optimize the algorithm, I used other tricks like TF-IDF (term frequency, inverse document frequency) and even created some phrasing using n-grams.
Still with me?
Once my machine learning algorithm was performing to expectations, I deployed it to AWS using Docker, Amazon ECR, Amazon ECS and autoscaling computing resources using Amazon Fargate. The machine learning algorithm runs daily at 5:45am using Cloudwatch and Cron scheduling.
MachinaNova, the news machine | User Interface and Presentation
At this point we have a database on Amazon Web Services being updated daily with thousands of scored news articles a day. Now it's time to present that information to the end user.
I love Django, Jinja2 and Materialize CSS!
During this project, I fell in love with a super easy to use software that allows me to build beautiful web applications using my favorite language, Python. The solution, Django, creates a database, gives me some security features and allows me to bring in templating and CSS to make my application pretty, updated and integrated with the data I'm collecting.
So, the solution I built is ready and available on my website at courtneyperigo.com/news; but the dirty truth is the solution is actually securely hosted by Amazon's Route 53 at newsml.machinanova.ninja.
I'm so proud to have completed this project. It challenged me to learn new tools, forced me to learn AWS and absolutely pummels other news recommendation services.
Suck bagels, Paper.li!
The code name of this project was Project Quito. Quito because once I complete a project like this, I reward myself with a vacation. Looks like I need to brush up on my Spanish.