with The Buzz Contributor Adam McCleod
Predicting the results of a Baseball game is an impossible pursuit. The possible occurrences in a single pitcher vs batter event are nearly innumerable. However, because of the mathematical properties of baseball, with its clear distinctions between a “success” and a “failure” in the win/loss, ball/strike, hit/out, chance/error, safe/out, etc. relationships just begs to be predicted. It’s become an industry all to its own, with complicated mathematical models used by teams and pundits to try and get a vision of what a season will contain.
The Buzz, leaning on work done both by outside sources, credit to Derek Carty and his projection system The Bat, and previous seasons’ predictions by myself and contributor Adam McLeod for other platforms, has finished putting together our first 2021 MLB Season Standings projection.
Here are the resulting win/loss totals and division standings in the two separate prediction sets for 2021 that we ran, then a few words about the methodology we used, and then the full files themselves. Move the slider to compare the two sets.
For our model, the key to predicting wins is determining the relationship between runs scored, runs allowed, and wins. For this, we turned to sabermetric pioneer and nerd-celebrity Bill James and his Pythagorean Winning Percentage. Essentially, the formula is P = (runs scored ^ 2)/((runs scored ^ 2) + (runs allowed ^ 2)). However, in order to use James’ formula we needed to predict how many runs each team would score and allow.
The allow side of the equation is pretty simple. Earned Run Average (ERA) is simply the number of earned runs a pitcher gives up divided by 9. Using The Bat’s projected ERA for the pitchers in a team’s rotation and bullpen, and throwing in a small adjustment for unearned runs, we were able to calculate the projected runs allowed for each team. The example below is the projected starting rotation for the St. Louis Cardinals. The playing time percentage is out of 200 innings and is The Buzz’s own projection.
|Name||Playing Time %||ERA||Innings||Total Runs|
|Daniel Ponce de Leon||0.50||4.54||100||50.44|
On the runs scored side, things are a little more complicated. The true question we needed to answer is how many runs is a player worth, or more specifically how many runs will a player create? There are several different runs created formulas but they require inputs such as hits, walks, total bases, hit by pitch, and other details. We wanted to step back and see if we could boil down a correlation with fewer data points. Using machine learning and a linear regression, we found a formula that transformed on base percentage (OBA) and slugging percentage (SLG) into runs created. This example shows part of the Yankees offense, using The Bat’s projections and our provided playing time.
|Name||PAs||OBA||SLG||Weighted OBA||Weighted SLG||Generated Runs|
We ran two sets of predictions this year and are looking forward to seeing which set is more accurate. In the first set, we used The Bat’s numbers across the board with a general template for playing time. The template doesn’t consider the players as much as the role when assigning innings or plate appearances and is very even across the board. For the second set, we went through each roster and adjusted the playing time and some of the projected stats as we saw fit. In the embeds below, the set with the light blue shading is our adjusted projection. The light blue represents a data point we changed from the standard template.
For those who are interested, in 2019 we correctly picked 8 of 10 playoff teams and in 2020, 13 of 16. We were also within 10% of the runs allowed by the AL West when compared with the actuals from 2019. Check out the files below and let us know what you think in the comments!