One of the most interesting breakthroughs in the world of sports statistics was Bill James’s creation of the “Pythagorean Expectation”. This module predicts a given baseball team’s win percentage based on their number or runs scored and runs allowed. The basic formula for this is: Predicted Win % = (RS2) / (RS2 + RA2). Recently, Professor Abraham Wyner from the University of Pennsylvania came out with his modified version of James’s model. Wyner’s formula takes out all the exponents from the equation: Predicted Win % = (RS – RA) / (RS + RA). This simplification produces virtually the same predicted Win %, and was created to make it easier to do the calculations. With both models, one can determine a team’s predicted win totals down to about ± 10 wins most of the time.
The Pythagorean Expectation has been applied to many other sports, including basketball and hockey. However, one of the sports that it never seemed to forecast correctly was soccer. One reason was that “points” are used instead of “wins” and teams are also able to draw games, where each team receives 1 point. Another reason was that the various leagues around the world don’t all play the same number of games, which complicates making a universal forecasting model. However, while working with a friend of mine, Michael Berman, I believe I came across an extremely accurate model that predicts points for soccer. The formula I used directly mirrors that of Professor Wyner’s modification of James’s Pythagorean Expectation:
Points Per Game = 1.7 * (Goals Scored – Goals allowed) / Goals Scored + Goals Allowed)) + 1.35
In this article, I am going to be testing this model against the top 5 leagues in Europe over the past 10 years.
To start out, I tested the model against the Italian Serie A. Using data from the last 10 years, I ran my forecast against every team’s actual performance. Here is what I found:
This model has a correlation coefficient is 0.9648, and a root mean square error (RMSE) of .1137 points/game, or 4.32 points over a season. What this means is that this model can predict a Serie A team’s success to within 8.64 points about 95% of the time.
We then performed the same steps upon the past 10 years of La Liga data. Here’s what the top division of Spain gave us:
La Liga had a correlation coefficient of 0.9589, and an RMSE of .1276 points/game, or 4.85 per season. This forecasts a La Liga team’s success to within 9.7 points nearly every time. Even then, La Liga was actually the least accurate of the 5 leagues we tested.
As for the English Premier League, we were able to gather data from the past 24 years, and we once again received very encouraging feedback.
In this case, the correlation coefficient came out to .9546, and the RMSE was .1226 points/game, or 4.66 per season. Therefore, the model effectively predicted final points down to within 9.22 points 95% of the time.
The Bundesliga was the only league we studied that had 34 games as opposed to the typical 38 played in other leagues. However, because our model operates in points per game, this was no problem.
In fact, our prediction for this league was one of the most accurate, with a correlation coefficient of .9547, and an RMSE of .1232 points/game, or 4.19 points per season. This mean that 95% of the time, we correctly predicted a German team’s final points within just 8.38 points.
The final league we looked at was Ligue 1, the French top division.
Ligue 1 produced a correlation coefficient of .9508, and an RMSE of just .1145 points/game, or 4.35 per season. This means that for 95% of the time, our predicted values were within 8.7 points of the actual results.
This is not only an accurate Pythagorean model, but it is also very flexible, as we saw by running it through various leagues. This model can also be used mid-season to see whether a team is underperforming or overperforming their expected points per game, and that can help to predict whether they will improve or worsen in the latter part of a season. As I mentioned earlier, the baseball pythagorean expectation varied usually about 10 games. With this model, an interval of around 8.5 points is just under 3 wins over the total season, and that comes with 95% precision.
Up until now the most accurate model we saw had an RMSE of 4.7 pts/season (ours is around 4.25 on average), and that model only worked for leagues with 38 games. In addition to this, it could only be used after all games had been played. So, while creating the most accurate “pythagorean model” for soccer, we also developed a tool that can be used to figure! out what teams have been the “luckiest and unluckiest” given their performances, and also forecast how a team will perform for the remainder of the season (using the assumption that a team will regress towards their expected points per game value).
It will be interesting to put this up to the test in the upcoming 2017-18 seasons, and we expect to find high accuracy all around the world. While this model isn’t perfect, it’s very close to it.
Authors: Nikhil Mehta, Michael Berman