Stats for Baseball Fans: The Single Metric for Pitching is ERA.
It's baseball time again. Where I'm writing from, Chicago, the snow has started to melt and the Cubs are giving us hope in Spring Training in Arizona.
Of course it's time to dust off the stats records and take a look at baseball and statistics with a data scientist.
As I described in my earlier blog on offensive statistics, here, Moneyball statisticians will look at a different set of metrics to assess the skill of a player. Pitching is no different. Metrics like "Runs Allowed per 9 Innings" and "Adjusted Wins Above Replacement" allow those analysts to understand the pitchers true impact by controlling for a team's defensive capability or the game state when the pitcher enters the game.
Unfortunately, those metrics are not available on national TV broadcasts and you won't be able to calculate them while enjoying the game live. So what should the average fan focus on when accessing MLB pitchers!? In this post, I explain the single metric to focus on — with a little help from statistics.
Why did I write this?
What's a baseball article doing on a data scientist's CV and career website? It's because I have merged something I grew up loving (baseball) with something I built my career on (data science.)
I also grew up as a baseball card collector. My Dad spent his hard earned cash getting me a pack or two after work every now and then to share with a son who played little league and took a liking to the hobby.
The thing about baseball cards is they are filled with numbers. Metrics that broke down every little detail about how a player plays the game. I'd compare and contrast my favorite players and some of the most obvious stats would jump out. Frank Thomas, a huge baseball player from Georgia, would crush a ton of home runs and his cards would show that. Nolan Ryan, a Texas hurler, would have a ton of strikeouts listed on his cards.
Even someone who looks at these stats as a hobby has trouble knowing what to focus on to understand the best players in the game.
Today's blog is a follow up to my previous blog looking at the best batters in the game; and I'll make it simple for the most casual of fans out there to answer this question:
What is the one statistic you need to focus on to understand the best pitcher on your team. That stat is the player's ERA; but let's go on a journey to understand why that is, using statistics.
TLDR: The player with the lowest ERA statistic on your favorite team is VERY LIKELY the best pitcher they have. Remember to stand up and cheer when you hear their entry music.
Pitching stats = information overload
Pitching stats are insane. Major League Baseball uses high powered cameras to collect velocity and spin of pitches thrown at all major league baseball stadiums. The same system also measures where each player is on the field at all times.
As an analyst and data scientist, this data fascinates me. As a fan looking to just enjoy the game, it looks cool on tv - but there's no way to crunch all that data. Surely there is one metric I should pay attention to when I want to understand the best pitcher on the team I'm rooting for.
Now I do realize it's very easy to understand who the best pitcher is on your team. You can tell from where the coach puts them in the rotation, who is called on in closing situations, or who is the starter for a crucial match up with a rival team. That's great and all, but how can I cut through the noise and just focus on players that pound for pound - just prevent runs from being scored against my team?
Stats will help us find that pitcher.
The Problem with Understanding Baseball Stats: Players are Human.
As I covered in my previous blog on this topic, (here) there's a problem with using player's individual performances to understand the value of a statistic.
Players are human. They take days off. They have bad days.
For our pitchers analysis, we need to once again control for human variability. Instead of analyzing pitchers, we will analyze teams' pitching capability to identify the metric to focus on.
We focus on team performance because, on average, each team should have a similar number of innings pitched and equal opportunity to have runs scored against them. Since it is less human and normalized, team level data is the ticket for understanding the importance of a baseball statistic.
Here's a visual of player performance versus the same metric at the team level. One shows a wide variety of individual player performance with a large number of players scoring very little runs with some amazing players scoring over 2,000! The other shows team performance tends to be normal and distributed around the average of about 720 runs per team in a season.
Pitching = Not Letting the Other Team Score Runs.
Dictionary.com defines baseball in this way:
(Baseball is) a game of ball between two nine-player teams played usually for nine innings on a field that has as a focal point a diamond-shaped infield with a home plate and three other bases, 90 feet (27 meters) apart, forming a circuit that must be completed by a base runner in order to score, the central offensive action entailing hitting of a pitched ball with a wooden or metal bat and running of the bases, the winner being the team scoring the most runs. - Dictionary.com
The best pitcher will be the one that does the best job of preventing runs from being scored against their team. - Me
We care about runs because that is the single most important objective of a player on offense - to hit runners in or turn themselves into a run by reaching home plate. The pitcher is preventing this from happening. In this analysis of pitchers, it requires the objective function (y) to be runs against the pitcher’s team (i.e. runs against.) The goal of the analysis is to find a pitching statistic that is most correlated to runs against. Our hypothesis is that a pitcher's ability to prevent hits without walking players will signal a strong pitcher. How do we capture that in a single metric?
<spoiler> ERA is Most Correlated with Runs Against. The best pitcher on your team likely has the LOWEST ERA.</spoiler>
We set off on this analysis by calculating some statistics that you'll find projected on scoreboards throughout MLB stadiums. Potential metrics include earned run average (ERA), walks plus hits per inning pitched (WHIP), hits per 9 innings (h/9), strike out percentage (K%) as well as other common pitching metrics of hitting stats dealt or given up (K, BB, H, 2B, 3B, HR, etc.)
Some of these metrics we had to create from scratch and their definitions are shown below:
WHIP = (walks + hits) / innings pitched
K/BB Ratio = strikeouts / walks
ERA = earned runs / innings pitched * 9
Earlier, I mentioned we normalized our data by using team seasonal totals rather than individual player totals. I further normalized by removing some outlier seasons. Below are full details of my team outlier removal:
Removed teams before 1970: Several key metrics weren't tracked prior to the 1970 season (including sacrifice flies, hit by pitch and others.) We also know that rule changes make 1970 a good break point to normalize the pitching data. (more info here)
Removed team seasons where the number of games played was below 158: This would remove seasons that were cut short by strikes and other schedule oddities. (Goodbye 1994 season.)
Removed teams that do not play in the National or American Leagues: We don't care about the minor leagues or spring training leagues for this analysis.
Once we have clean data, we run the analysis that correlates common pitching metrics with runs scored against the team. The results of the analysis show that ERA is the most important metric with a correlation of 0.982 (extremely correlated to runs.) The lower the ERA, the better the pitcher.
Best Pitchers Ever by ERA with *adjustment*
Now that we know that we should pay most attention to ERA, the next question I would ask is who are the best pitchers according to this metric? The answers are easy to calculate...
Actually, they're not.
If I look at the lowest ERAs in baseball history, I notice a very interesting trend...
Modern era pitchers have higher ERAs than pitchers from 1871-1968. Why?
The players from the "dead ball" era (1871-1920) and even the "golden era" (1921-1968) were known for their use of pitching techniques that are now illegal.
The dead ball pitchers were known to use the spit ball. A pitch, now illegal, that moved unpredictably and hurt batter productivity in baseball.
Golden era pitchers had the advantage of larger strike zones. In 1968, the strike zone shrunk and even tighter restrictions on pitching techniques.
To pick the best pitchers of any era, we'll need to adjust their ERA to account for these rules changes. Fortunately, our statistics "chops" will help do that.
The Linear Model
To identify how much I should adjust the dead ball era and golden era pitchers, we need to use a model that can account for the affect of the year that pitcher pitched. So, we return to our teams data and use a linear model to identify the proper adjustment to any pitcher that pitched primarily during each of those eras.
My method was to add a dummy variable. If a pitcher made more than 50% of their appearances during the dead ball era, they were assigned a “1” in the “ap_era_deadball” feature. If a pitcher made more than 50% of their appearances during the golden era, they were assigned a “1” in the “ap_era_golden” feature. All other pitchers were assigned a 0 as they were modern era pitchers.
We use the linear regression model to understand how much we should adjust each pitcher that pitched during the dead ball and golden eras of baseball.
The output of my model is below:
You’ll see that this model is okay. Each of our statistics is statistically significant; but the model only accounts for 19.8% of the variance. The coefficients are the most important output, however. They are statistically significant and give us a way to adjust earlier pitchers so modern pitchers have a fair chance at being the best pitcher in history.
"Dead ball" era pitchers will have 0.83529 added to their career ERA.
"Golden" era pitchers will have 0.15738 added to their career ERA.
"Modern" era pitchers will have no adjustment to their career ERA.
Without further adieu, below is the list of the top 15 all-time greatest starters according to my analysis.
My childhood hero, Nolan Ryan, makes the cut according to ERA. There are no active starters in the top 15; and Grover Cleveland "Pete" Alexander is the king of the starters.
Here is the list of the top 15 relievers according to my analysis.
The pitcher with the fastest fastball ever recorded, Aroldis Chapman, is the top reliever. He’s also an active player with the New York Yankees, and a former Chicago Cub, ready to show us his stuff in 2021.
And finally the top 15 closers according to my analysis.
A controversial pick, Craig Kimbrel, sits at the top of the closers list. Most people would fight me for writing that, but Mariano Rivera is a close second — so give me a little credit. The best mustache in the game, owned by Rollie Fingers, sits at #11 on the top 15 closers list.
ERA is a tried and true metric that when push comes to shove tells you which pitchers are best at preventing runs from being scored.
My recommendation is to watch out for pitchers with the lowest career ERAs on your team (minimum 100 games played.) Those pitchers are your best chance at stopping opposing batters from scoring runs.