A new model could have a “Moneyball-like” impact on the value of baseball players
UNIVERSITY PARK, Pa. — In the movie “Moneyball,” a young economics graduate and cash-strapped Major League Baseball general manager present a new way to assess the value of baseball players. Their innovative idea of calculating statistical data and player salaries allowed the Oakland A’s to recruit quality talent overlooked by other teams, completely revitalizing the team without going over budget.
New research at Penn State College of Information Sciences and Technology could have a similar impact on the sport. The team developed a machine learning model that could better measure the short- and long-term performance of baseball players and teams, compared to existing statistical analysis methods for sports. Building on recent advances in natural language processing and computer vision, their approach would completely change and potentially improve the way a game’s state and a player’s impact on the game are measured.
According to Connor Heaton, a PhD student at the College of IST, the existing family of methods, known as sabermetrics, relies on the number of times a player or team achieves a discrete event – such as hitting a double or a home run. However, it does not take into account the surrounding context of each action.
“Think of a scenario where a player recorded a single in his last plate appearance,” Heaton said. “He could have hit a dribbler on the third base line, got a runner forward from first to second and beat the pitch to first, or hit a ball into deep left field and reached first base comfortably but hadn’t not the speed needed to push for a double Describing the two situations as resulting in a “single” is accurate but doesn’t tell the whole story.
Heaton’s model instead learns the meaning of in-game events based on the impact they have on the game and the context in which they occur, and then produces numerical representations of players’ impact on the game. game by viewing the game as a sequence of events.
“We often talk about baseball in terms of ‘this player hit two singles and a double yesterday’ or ‘he made one for four,'” Heaton said. summary statistic. Our work tries to take a more holistic picture of the game and get a more nuanced computational description of the impact players have on the game.”
In Heaton’s new method, he exploits sequential modeling techniques used in natural language processing to help computers learn the role or meaning of different words. He applied this approach to teach his model the role or meaning of different events in a baseball game – for example, when a batter hits a single. Next, he modeled the game as a sequence of events to provide new insight into existing statistics.
“The impact of this work is the framework that’s offered for what I like to call ‘interrogating the game,'” Heaton said. “We see it as one sequence in all this computer scaffolding to model a game.”
The output of the model can effectively describe a player’s influence on the game in the short term, or its form. Displayed as 64-element vectors – achieved by adapting work from computer vision – these shape embeddings capture a player’s influence in-game and can be used effectively to depict their short-term impact, such as the duration of 15 plate appearances, or averaged together to analyze longer periods, such as over the course of the player’s career. Additionally, when combined with traditional sabermetry, form embeddings can predict the winner of a game with over 59% accuracy.
Heaton described how the embeddings created by both his method and the traditional sabermetric method plot the same data. When viewed over time, sabermetric depictions of player impact can be somewhat sporadic, changing significantly from game to game. Heaton’s method helps “smooth out” the way players are portrayed over time, while allowing player performance to fluctuate.
“Both integrations can help differentiate good players from bad players,” Heaton said. “But ours provides a lot more nuance on exactly how good players impact the game.”
To train their model, the researchers used data previously collected from systems installed in major league stadiums that track detailed information about every pitch thrown, such as player positioning on the field, base occupancy , velocity and rotation of the terrain. They focused on two types of data: step-by-step data, to analyze information such as step type and launch angle; and season-by-season data, to investigate position-specific information such as walks and hits per inning pitched for pitchers and on-base plus slugging percentage for hitters.
Each pitch in the collected dataset has three identifying characteristics: the game in which it took place, the in-game batting number, and the in-batting pitch number. Using these three bits of information, researchers were able to completely piece together the sequence of events that make up an MLB game.
The researchers then identified 325 possible game changes that could occur when a pitch is thrown, such as changes in the number of ball hits and base occupancy. They combined this information with existing stepping data that describes pitching and batting action, then grabbed player records from sabermetrics to be able to describe what happened, how it happened past and who was involved in each game.
The work blends Heaton’s research focus on natural language processing with his interest in the historical statistical analysis of baseball.
“There’s this whole ecosystem built around modeling language and word sequence,” Heaton said. “It seems there was potential for it to be adopted for modeling sequences of other things; to generalize a bit. I started thinking about sports analysis and it seemed like there was a lot to be done to improve both our understanding of the game and the way the game is modeled computationally.
The researchers hope their work will serve as a solid starting point toward a new way of describing the impact of baseball and other sports athletes on the course of the game.
“This work has the potential to significantly advance the state of the art in sabermetrics,” said Prasenjit Mitra, professor of information science and technology and co-author of the paper. “To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and use this information as context to evaluate individual events that are counted by traditional statistics – for example, by automatically building a model which includes key moments and milestones.
Heaton and Mitra presented their paper, “Using Machine Learning to Describe Player Impact on Play in MLB,” was one of seven finalists in the 2022 Research Paper Competition at the MIT Sloan Sports Conference Analytics earlier this month.
More information about the competition, as well as links to the paper, its open source code and data, can be found at www.sloansportsconference.com/research-paper-competition.