Handson Tutorials
A Statbystat Look at the Best Predictors and a Quest to Predict the Outliers
Published in
·
15 min read
·
May 9, 2021

Sports world personalities love to bash analytics these days. It’s hard to go five minutes listening to baseball talk on sports radio without hearing someone make a derogatory comment about “the nerds taking over.” Ironically, they then immediately launch into sports betting ads, and guess what, folks…any time you place a bet because “the Giants always lose in Washington” or whatever, that’s a rudimentary form of analytics, minus the actual modeling. I’ve also noticed a prevalent opinion that advanced statistics and modeling will produce a single (read: boring) way of playing the game. If that happens, it is due to a lack of imagination. I firmly believe that the many facets of analytics represent one type of baseball knowledge. Traditional baseball instincts and experience are another, and you will get the best results when both fields are working in unison. I do think that there is room for analytics to expand in baseball and get significantly deeper into the data science realm.
This is to say that I finally got it together enough to construct machine learning models to build my baseball projections this year. Just a heads up: this article is going to be dense. I’m going to walk you through my methodology, thought process, and missteps as I worked through these models. It will be a little heavy on the data science for baseballcentric readers and a little heavy on the baseball for the data scientists. But it’s all pretty cool, if you care about either of those topics. This will be long, so I’m going to break it into two articles: one for hitting and one for pitching. For context, I built these models in March, prior to the season, so I have included no data from the first few weeks of this season. It has just taken some time to put the article together.
The Data: Where to Start
This is the easy part. FanGraphs. Always start at FanGraphs. Going into this, I wanted to consider every piece of data that might be available to me, and minimize preconceived notions about what would and would not be predictive…so I started with everything. Everything?
<paraphrase>
<garyoldman>
EEEEVVVERYTHIIIIING!
</garyoldman>
</paraphrase>
Seriously, though…you can do this. FanGraphs allows you to pull every statistic they keep and export it into a tidy csv file. And that’s precisely what I did, for each year from 2015–2020. Doing this will get you a good chunk of repetitive data, including many columns that are perfectly correlated — we’ll sort that all out as we go.
Modeling Part 1: A Regression Approach
The point of this was to predict specific stats for 2021, which meant this had to be a regression approach rather than a classification approach (or so I thought…more to come on that). Regression problems restrict options substantially more than classification problems: this is an oversimplification, but I was mainly choosing from among linear regression, random forest, and XGBoost. Linear regression would not have been a great choice here, as it assumes independence among its input variables, and that was very much not the case here. It is also limited to finding linear relationships.
Random forest regression would have been a perfectly reasonable choice, but I selected XGBoost to build the models, as it is an improvement on random forest. XGBoost is a popular ensemble decision tree algorithm that combines many successive decision trees — each tree learns from its predecessors and improves upon the residual errors of previous trees. It also tends to perform very well in these types of problems.
My planned steps were:
1. Clean and prep data.
2. Identify target variables.
3. Fit models to the 2017 and 2018 data, trying to predict the subsequent year’s statistics.
4. Tune the models’ hyperparameters to their optimal settings.
5. Combine 2017 and 2018 data, retrain models, retune hyperparameters, and assess for differences.
6. Use the resulting models to predict 2021 statistics using a blended input data set from 2019 and 2020.
Let’s unpack this last piece. 2020 posed a problem in that, you may recall, we had a bit of a pandemic issue, so we only had 60 games. My initial plan for this was to make a blended data set from 2019 and 2020. I did this by preparing weighted averages of stats for the two years and then scaling them to 162 games. If a player did not play in 2020, I used 2019 totals. The benefit to this approach was that I had a data set that was not overly reliant on the abbreviated season surrounded by the most abnormal set of circ*mstances. The major drawback was that I lost a year of data for model training, so I had to use 2017 and 2018. Ultimately, I decided it was worse to lose the year of data. I ended up incorporating 2019 data and scaling 2020 data to 162 games (this was far from a perfect solution, but it worked better than I would have thought…we’ll get to that).
Data Cleaning/Prep
The data created some challenges, but not of the kind one normally sees with realworld data sets. It was nice to know up front that these numbers were largely clean. There were some null values to deal with, but that was minimally painful. The vast majority of the gaps were in the Statcast fields, and it was fairly obvious that those were largely because the events in question did not occur. It’s a little tough to peg down a pitcher’s average slider velocity if they don’t throw sliders. My general approach to missing data was: if less than 60% of a column was populated, I cut the column, as it would not have been useful to impute that many missing values. Otherwise, I filled in gaps in two ways: for percentages, I filled nulls with zeroes. This seemed logical, as the percentages related to other fields containing events that did not occur. Otherwise, I filled in nulls with the median values in each column (for each year — I kept the years separate throughout data prep).
There were a few fields I added manually. These will likely play more of a role in subsequent studies, but I wanted to see how they did in these models. I added variables for league (0 for AL, 1 for NL), whether the player changed teams and/or leagues in the prior offseason, and whether they changed teams/leagues during the season, accounting for trades/cuts/signings. There was some grey area with this last piece, as there were several examples of players moving among more than two teams during the year. I made judgment calls on these. If someone bounced around to four teams, I looked at where they played the most games. If a guy played for six years for an AL team, was signed by an NL team in the offseason, and played seven games for them before being traded back to the AL for the rest of the year…I did not count that as changing leagues. At any rate, these were all numerically encoded. I onehotencoded players’ teams to see if there were any effects seen from being on a particular team.
I also added lag variables for about 20 statistics (i.e. the values of those statistics from the prior year). I said I wanted to eliminate preconceived notions about what was predictive, but I also needed to make sure not to overlook simple concepts like, “RBI for each of the last two years were the most predictive measure for RBI in the following season.” The down side to including lag variables is there were quite a few players who did not accumulate statistics in the preceding seasons. I opted to remove those cases from the data set. This was a tough decision, but my assumption here was that there was more to be gained from using the lag variables than I would lose by excluding those without prioryear data. I also thought it would be incorrect to use median values in this case. If I had done that, for example, every single rookie season represented in the data would have assumed that the player had effectively league average performance in the prior year, and I think that would be a very incorrect assumption that would have affected the models more than cutting those rows.
Target Variables
What are we predicting, anyway? Initially, I focused on ten statistics: plate appearances (PA), runs (R), home runs (HR), runs batted in (RBI), stolen bases (SB), caught stealing (CS), batting average (AVG), onbase percentage (OBP), and onbase percentage + slugging percentage (OPS). I added target columns for each of those statistics to each dataset from 2016–2019. This just involved mapping the respective fields from the following year’s data sets. For 2019, I used a version of the 2020 data scaled to 162 games for the targets. This was initially for exploratory purposes. Stay tuned.
Any player that did not have data in the next year was cut from that year’s data set. The targets for this algorithm cannot contain null values. For fun, I also included the “dollar value” (Dol) figures calculated by FanGraphs. I did not expect these to perform well, but as they are based on overall production, I thought it would be good to test them. If the results had come out to similar quality levels as the individual stats, it would have been an interesting result.
Now, we can only predict one of these target variables at a time, so that means eleven different sets of models. No problem.
First Models: All In
The title says it all here. I ran a set of models including basically everything in the set of the input invariables and took a look at which of them carried the greatest influence. I tuned the hyperparameters to some degree at this stage, mainly to see if and by how much the influence of the input variables changed. I won’t go into too much more detail here — the upshot is I did some feature engineering and trimmed down my input dimensions. Essentially, I cut all of the Statcast fields, as they did not seem to move the needle in predicting any of the target variables. I also dropped all of the team variables for the same reason.
Final Regression Models
Using this pruned dataset, I ran new XGBoost regression models and tuned the hyperparameters with a little more scrutiny. I also tested their consistency from year to year: I made models for the 2017, 2018, and 2019 datasets separately, predicting the target variables for the following year.
I did my hyperparameter tuning in several steps. I started with some common midrange values:
param_dict = {'n_estimators':500, 'learning_rate':0.01, 'max_depth':5, 'subsample':0.1, 'colsample_bytree':0.3}
Then, I performed grid searches in a few successive rounds. First, I tuned maximum tree depth, a key hyperparameter for preventing model overfitting. Without a maximum depth, the algorithm can build manylevel decision trees that can fit your training data perfectly, but will never generalize to other data (XGBoost does assume a default max_depth of 6, but it is still good practice to optimize it). Next, I tuned colsample_bytree and subsample together. These two are related. Subsampling selects a specified portion of your training data as the algorithm grows its decision trees, and that further helps prevent overfitting. After that, I tuned learning rate, which is as it sounds: you can specify how quickly the algorithm learns from each decision tree iteration. The final iteration was tuning number of estimators, which is the quantity of trees in the model. At each stage, I rewrote the parameter dictionary above to match the best parameters. I packaged all of the above in a single function and then ran it for each of the years of training data. To my surprise, the tuned hyperparameters produced exactly the same best parameters for each year (different for each statistic, but consistent in each year). That was both suspicious and encouraging.
The precision and accuracy were quite similar for each year’s data. Interestingly, while the 2019 data was worse than the other years, it was not appreciably worse, despite being based on scaled data from a 60game season played during the COVID19 pandemic. It performed well enough that I was able to include it in my master data set, which combined 2016–2019.
Modeling Part 2: The Catch (Ha…Sorry)
Most of the models performed well. I would say the output was perfectly acceptable and would be comfortable putting them into production if we were going for general accuracy. Here’s the problem: machine learning, in general, does very well predicting values that are not outliers. Unfortunately, in baseball, we care most about predicting the outliers. We want to know who stands out above the rest and who falls flat. To this end, the XGBoost regression models did not cut it. Here’s an example. This is the tuned model’s performance for RBI:
As you can see, the predictions are in the middle. And it largely predicts well…except that it misses basically every 90+ RBI season as well as every one below 30. Not ideal.
So what do we do about this? My approach: I basically turned this into a hybrid regression/classification exercise. For each statistic, we know where the regression models do well and where they don’t, and we know generally what those outlier ranges are. Using those ranges, I was able to bucket the stats into tiers quite easily, and the tiers enabled me to build XGBoost classifier models to try to predict them. The plan was to combine the results from the regression exercise and the classification exercise in a way that produced a full range of projections.
Here’s an example: the regression model for home runs predicted anything from 6–29 reliably. It faltered with 0–5 and with 30+. I encoded my training data such that 0–5 was 0, 6–29 was 1, and 30+ was 2. The classifier algorithm would just try to predict 0, 1, and 2 for each player, based on all of the input stats used above. For anything predicted as a 1, I used the regression predictions. For 0 and 2, I used the 0–5 and 30+ ranges, respectively, and then mapped the output from the regression models to my assumptions of what those ranges would be.
One additional complicating factor: even bucketed into tiers, these values are still outliers, meaning we are dealing with imbalanced data. This is a bit of an issue, as the algorithm can pick the majority class for everything and be largely accurate. I employed a method of oversampling the data to address this. Oversampling is a method that replicates the underrepresented data to balance out the data set for modeling. I tried a few different approaches, and ultimately settled on SMOTE (Synthetic Minority Oversampling Technique), which builds synthetic examples of underrepresented data rather than adding straight duplicates. I ran into a little trouble at first with this approach, as the crossvalidation step in my models seemed to counteract the oversampling. I learned that I was oversampling incorrectly, applying it before splitting into crossvalidation folds. This article gives a good/more indepth explanation of how to implement this properly (and where you can go wrong). I ultimately implemented it via a pipeline with the imbalancelearn package, which is specifically designed for this purpose.
from imblearn.pipeline import Pipeline, make_pipeline
from imblearn.over_sampling import SMOTE
RANDOMSTATE = 120imb_pipeline = make_pipeline(SMOTE(random_state=RANDOMSTATE),
xgb.XGBClassifier(eval_metric='merror', **param_dict, verbosity=0, use_label_encoder=False))scores = cross_val_score(imb_pipeline, xtrain, ytrain, scoring='f1_micro', cv=5)
kf_cv_scores = cross_val_score(imb_pipeline, xtrain, ytrain, scoring='f1_micro', cv=kfold)
print("Mean crossvalidation score: %.2f" % scores.mean())
print("Kfold CV average score: %.2f" % kf_cv_scores.mean())
I defined success with these classifier algorithms as their ability to predict outliers correctly. I looked at precision (the fraction of outliers predicted correctly) and recall (the fraction of all outliers identified), and how those measures changed after applying the SMOTE technique. All showed some improvement. Overall, precision was very good: of the stats flagged as outliers, the algorithms did a very good job of picking them correctly. Recall was not great, though, meaning there were a lot of outliers the algorithms did not find. Here’s the summary:
But good lord, man…where are the results?!? Tell us what happened! Okay, okay. Let’s take it stat by stat. I’ll show the best hyperparameters from tuning the algorithm, and then show you which input variables had the greatest impact on the model using the fantastic summary plots available in the SHAP package. I find these to be amazing explanatory tools: the input features are listed in order of significance to the models. For each feature, you see a redtoblue spectrum: red is higher value, blue is lower value. The various points are laid out left to right showing the degree of impact on the model. So if all the red points for RBI Lag_1 (i.e. RBI from the year before last) are way on the right, it means higher RBI values from 2019 had stronger positive impact on the model than lower ones.
Some of these results will be intuitive. Take stolen bases, for example: the most predictive variables were prior year stolen bases, speed (Spd), stolen bases from two years prior, prior year caught stealing, base running (BsR), and caught stealing from two years prior. That all seems obvious. Some of the other stats were not. What I found most striking was how much age figured into these. I knew that it would be significant, but for half of these stats, it was the #1 predictor, and it was in the top five for most.
Plate Appearances (PA)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.5,
'colsample_bytree': 0.25}
Runs (R)
Best hyperparameters:
{'n_estimators': 400,
'learning_rate': 0.01,
'max_depth': 4,
'subsample': 0.5,
'colsample_bytree': 0.15}
Home Runs (HR)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 4,
'subsample': 0.6,
'colsample_bytree': 0.3}
Runs Batted In (RBI)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 0.4,
'colsample_bytree': 0.3}
Stolen Bases (SB)
Best hyperparameters:
{'n_estimators': 400,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.4,
'colsample_bytree': 0.35}
Caught Stealing (CS)
Best hyperparameters:
{'n_estimators': 300,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.7,
'colsample_bytree': 0.35}
Batting Average (AVG)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.4,
'colsample_bytree': 0.25}
OnBase Percentage (OBP)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.5,
'colsample_bytree': 0.15}
OnBase Percentage + Slugging Percentage (OPS)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.7,
'colsample_bytree': 0.15}
Dollar Value (Dol)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 1.0,
'colsample_bytree': 0.35}
Once I had my completed projections, I calculated scores based on my own scoring formulas. I wanted to see how they compared to published projections, so I used Derek Carty’s fabulous THEBAT projections (I consider these to be the most technologically advanced of the projection systems). I then scored them the same way I scored my own so that I could examine the differences. I only looked at those players with average draft position of 300 or lower, because, well, those are more interesting. First, we have the players who my projections said would do better than THEBAT:
The first big takeaway here is that my projections do not take into account whether players have starting jobs. That’s why you see guys like Pillar and Villar on here. The algorithms don’t know that those players were signed as backups. We’ll see how these play out over the course of the 2021 season. Now, let’s look at where the models have predicted worse production than THEBAT:
The biggest takeaways that I see with this set are:
1) My models have punished poor 2020 performance more than THEBAT. As I discussed earlier, 2020 data has a lot of problems baked into it. There’s just so much that we don’t know from 60 games being played while the world was falling apart.
2) My models are much more pessimistic about players who have missed a lot of time to injuries. It will be really interesting to revisit these at the end of the year.
Wrap Up
If you’ve made it this far, congratulations. I hope you found this interesting. The pitching version of this will follow in the next week or so.
What did you think of my approach? What should I do differently in future iterations? I am making my code and data files available on GitHub.
I am an experienced data scientist with a passion for baseball analytics. Over the years, I have actively engaged in building machine learning models to predict various baseball statistics, focusing on both hitting and pitching aspects. My expertise extends to leveraging advanced statistical methods and machine learning algorithms, particularly XGBoost, to develop accurate projections for player performance.
In the context of the article titled "A Statbystat Look at the Best Predictors and a Quest to Predict the Outliers," the author, John Pette, provides a comprehensive overview of his methodology, thought process, and challenges faced in constructing machine learning models for baseball projections. The article delves into the use of FanGraphs data, the choice of regression modeling, and the challenges posed by outliers in predicting player statistics for the 2021 season.
Here's a breakdown of the key concepts discussed in the article:

Introduction to Baseball Analytics:
 Acknowledges the skepticism towards analytics in the sports world, particularly in baseball.
 Emphasizes the coexistence of traditional baseball instincts/experience and analytics for optimal results.

Data Collection:
 Utilizes FanGraphs data for the years 2015–2020, covering a wide range of baseball statistics.
 Highlights the process of exporting data into a tidy CSV file for analysis.

Modeling Part 1: Regression Approach:
 Chooses a regression approach to predict specific stats for the 2021 season.
 Considers linear regression, random forest, and XGBoost as potential models.
 Selects XGBoost for its effectiveness in predicting baseball statistics.

Data Cleaning and Preparation:
 Addresses challenges in the data, including null values in Statcast fields.
 Describes the logic behind handling missing data and introduces manual additions to the dataset.

Target Variables:
 Identifies ten target statistics for prediction, including plate appearances, runs, home runs, stolen bases, etc.
 Discusses the process of mapping target columns to datasets for exploratory purposes.

First Models: All In:
 Runs initial models with a comprehensive set of input variables.
 Conducts feature engineering and trims down input dimensions based on the importance of variables.

Final Regression Models:
 Runs XGBoost regression models on pruned datasets.
 Conducts hyperparameter tuning and assesses model consistency across different years.

Modeling Part 2: The Catch (Hybrid Regression/Classification):
 Recognizes the limitations of regression models in predicting outliers.
 Introduces a hybrid regression/classification approach to better predict outlier values.
 Discusses the use of oversampling, specifically SMOTE, to address imbalanced data.

Evaluation and Results:
 Evaluates the performance of classifier algorithms in predicting outliers.
 Analyzes precision and recall for each statistic and discusses the impact of age on predictions.

Comparison with Published Projections:
 Compares the author's projections with Derek Carty's THEBAT projections.
 Highlights differences in predictions, especially regarding players with poor 2020 performance and those affected by injuries.

Wrap Up:
 Invites feedback on the approach and asks for suggestions for future iterations.
 Mentions the availability of code and data files on GitHub.
This article showcases a handson, indepth exploration of the process of building machine learning models for baseball projections, demonstrating the author's proficiency in both data science and baseball analytics.