/cdn.vox-cdn.com/uploads/chorus_image/image/45982688/usa-today-8474625.0.jpg)
This post walks through the process we developed to rank the Top 500 soccer clubs in the CONCACAF region. As with our top 100 soccer clubs in the United States and Canada ranking the Elo rating system is what drives the ranking. However, this time we coupled the Elo system with the use of the Poisson distribution to set the crucial weights that the Elo system uses.
This rating includes the results of over 22,500 competitive games that occurred in over 35 countries going back to the 2011 season and through December of 2014. Only league games were used in the final ranking but games used in tournaments were used to set the initial weights and the weights of individual games.
The Elo rating system, developed by Hungarian-American mathematician Dr. Árpád Élő, is used by FIDE, the international chess federation, to rate chess players. In 1997 Bob Runyan adapted the Elo rating system to international football and posted the results on the Internet. Given how the soccer intelligentsia romanticizes a connection between soccer and chess, the use of this formula makes sense.
Footballdatabase.com uses Bob Runyan's methodology to rank club teams around the world. FIFA uses a modified version of Bob Runyan's adaptation to rank Women's International Soccer teams. The Elo system was adapted for soccer by adding a weighting for the kind of match, an adjustment for the home team advantage, and an adjustment for goal difference in the match result.
Each team starts with an initial score which sets an expectation for the quality of the team and the competition they will face. This initial weighting is critical to the final outcome, so we compiled hundreds of games between the teams and leagues to get the weights as justifiable as possible.
After each team has an initial weight, each game results in an adjustment to the team's score based on the following factors:
- The team's old rating
- The importance or weight of the match
- The goal difference of the match
- The result of the match including home field advantage
- The expected result of the match
The details of the actual score are below but first we'll begin with how the model works and adjustments we've made to make the ranking as accurate as possible.
Strengths of the Rating System
The advantage of the Elo system is that it is a thorough assessment of a team's historical results. The system uses the results of every league match, including the goal difference, and weights the outcome based on the importance and the expected outcome.
Weaknesses of the Rating System
The importance of each match is pre-determined by the rating designer. Therefore the weighting of each match could potentially have a subjective element embedded in the score.
In addition, each team must start with a score. That score is pre-determined by the rating designer based on the perceived strength of the team's schedule. Teams with a strong strength of schedule will start with a higher score. Teams with a lower strength of schedule will start with a lower score. The difference between these starting scores is an important and could be a subjective choice by the designer.
The goal difference factor as determined in the original adaptation was very strong. A team that wins a game by two goals will get 50% more points for that outcome than if they had won by 1 goal. A three goal difference gets 75% more than 1 goal difference.
Changes to the Elo System to Address Weaknesses
The key to a good statistical ranking is to create a solid estimate of the quality of the leagues the teams play in. The leagues strength is used to set the initial weights and game weights used by Elo. You can look at the methodology for how that was estimated here. The short version is that we used games from CONCACAF Champions League, US Open Cup and Copa MX tournaments drive the majority of the weights used here. We created a goal difference index to estimate what the expected goal difference that would occur if average teams from two leagues played each other on a neutral site. For example, if an average Liga MX team played an average MLS team on a neutral site, our model estimates that the Liga MX team would win on average by 1.0 goals.
The use of the Poisson distribution
But linking goal difference to the critical Elo initial weights is challenging. This is where the Poisson distribution comes in. Using Poisson, we can use the expected goals of the two leagues to drive a likelihood of winning percentage. This is a common distribution used to predict the outcome of soccer matches. Just input expected goals scored and voila, you can estimate the probability of all scoring outcomes occurring. Elo scores are used to predict winning percentage as well. So we can use Poisson and goal expectation to get a winning percentage and then take that winning percentage and back into initial weights.
Here is a table that walks through that process for a number of the leagues included.
Starting Elo Weight | League | GDI vs. Liga MX | Win % vs. Liga MX |
Elo/Poisson | |||
1550 | Liga MX | 0.0 | 50% |
1425 | Ascenco MX | 0.9 | 33% |
1400 | MLS | 1.0 | 30% |
1335 | Costa Rica | 1.5 | 23% |
1320 | Panama | 1.6 | 21% |
1310 | Honduras | 1.7 | 20% |
1300 | NASL | 1.8 | 19% |
1255 | Guatemala | 2.1 | 16% |
1185 | El Salvador | 2.4 | 11% |
1180 | USL | 2.4 | 11% |
1150 | Nicaragua | 2.7 | 9% |
1020 | Trinidad & Tobago | 3.1 | 5% |
970 | Haiti | 4.2 | 3% |
The initial and game weights of the 40 leagues included are as follows:
Country Div | ELO Starting Weight | ELO Game Weight |
Mexico - Liga MX | 1550 | 30 |
Mexico - Ascenso MX | 1425 | 26 |
USA & Canada - MLS | 1400 | 25 |
Costa Rica - Primera | 1335 | 23 |
Guatemala - Liga Nacional | 1255 | 20 |
Panama - LPF | 1320 | 22 |
Honduras - Liga Nacional | 1310 | 22 |
USA & Canada - NASL | 1300 | 22 |
Costa Rica - Asenco | 1110 | 15 |
Trinidad & Tobago - T&T Pro League | 1020 | 12 |
El Salvador - Primera Division | 1185 | 18 |
Nicaragua - Primera Division | 1150 | 17 |
USA & Canada - USL PRO | 1180 | 18 |
Guatemala - Primera Division | 1030 | 13 |
Jamaica - Premier League | 1000 | 12 |
Puerto Rico - LNFPR | 1000 | 12 |
Panama - Liga Nacional | 1095 | 15 |
Haiti - Championnat National | 1000 | 12 |
Guyana - Super League | 950 | 10 |
Mexico - Segundo Division | 1000 | 12 |
Antigua and Barbuda - Premier | 800 | 5 |
Aruba - Division di Honor | 800 | 5 |
Bahamas - Senior League | 800 | 5 |
Barbados - Premier | 800 | 5 |
Belize - Premier | 800 | 5 |
Bermuda - Premier | 800 | 5 |
British Virgin Islands - Premier | 800 | 5 |
Cayman Islands - Premier | 800 | 5 |
Cuba - Primera | 800 | 5 |
Turks and Caicos Islands - Football League | 800 | 5 |
Suriname - Hoofdklasse | 800 | 5 |
St. Kitts and Nevis - Premier League | 800 | 5 |
Martinique - Division d'Honneur | 800 | 5 |
Guadeloupe - Division d'Honneur | 800 | 5 |
Grenada - Premier Division | 800 | 5 |
French Guiana - Division d'Honneur | 800 | 5 |
Dominican Republic - Liga Mayor | 800 | 5 |
Dominica - Premiere League | 800 | 5 |
Curaçao - Sekshon Paga | 800 | 5 |
Suriname - Eerste Klasse | 770 | 4 |
A couple of quick notes. Second divisions with the exception of Asenco MX, which had good data, was weighted 225 points less at the start. The low score of 800 was based on expected win percentage between Liga MX teams and these teams. The game weights are calculated by assuming all game weights are between 30 to 5 and that the initial weights range between 800 and 1550 and everything was extrapolated in between. Suriname's 2nd division was placed below these levels. Also, for playoff games in each league 10 points was added to the game weight.
This initial weight represents the value for R_o (below) for the team's first match in the database.
The Basic Calculation
The Elo system has one formula which takes into account the factors mentioned above.
The ratings are based on the following formula:
R_n = R_o + K*G (W - W_e)
Where;
R_n = The new team rating
R_o = The old team rating
K = Weight index regarding the tournament of the match
G = A number from the index of goal differences
W = The result of the match
W_e = The expected result
Goal Differential = G
The number of goals is taken into account by use of a goal difference index. G is increased by 25% if a game is won by two goals, and if the game is won by three or more goals by a number decided through the appropriate calculation shown below;
If the game is a draw or is won by one goal
G = 1
If the game is won by two goals
G = 1.25
If the game is won by three or more goals
G = (11+N)/10
Where N is the goal difference
Result of the Match = W
W is the result of the game (1 for a win, 0.5 for a draw, and 0 for a loss).
Expected Result of Match = W_e
W_e is the expected result (win expectancy with a draw counting as 0.5) from the following formula:
W_e = 1 / (10^(-dr/400) + 1)
Where dr equals the difference in ratings plus 100 points for a team playing at home. So dr of 0 gives 0.5, of 120 gives 0.666 to the higher ranked team and 0.334 to the lower, and of 800 gives 0.99 to the higher ranked team and 0.01 to the lower.
This formula is calculated for each team in each game and the resulting score is carried forward to set the expectation for each team's next match.
Calculating the Probability of Game Outcomes
The expected result formula can be used to predict the outcome of games between two scored teams.
Take an average USL PRO team with 1100 points at home against an average MLS team with 1400 points. Add 100 to the USL PRO team for home field advantage.
To calculate the USL PRO team's odds of winning (including a draw being worth .5 points) use the formula above
W_e = 1 / (10^(-dr/400) + 1)
The dr for the USL PRO team is 1200-1400 = -200. Plugging -200 into the equation yields a result of 24%. Plugging in 200 would yield the result of 76%, the odds of winning for the MLS side.