Thirty two of Europe’s best soccer teams have recently competed in the EUFA Champions League Football season 2021 / 2022. Real Madrid won the competition after beating Liverpool in the final. The competition is over, but it is ‘immortalized’, if you will, in the form of statistics. We can have a look at the stats to get an idea of how the competition went on and how the clubs performed.
In this analysis, I processed and visualized club statistics of the EUFA Champions League 2021/22. The questions I try to address through this analysis are:
• How are the data distributed for each stats?
• How do a club’s final standing in the competition correlate with all other stats?
• How do individual stats correlates with each other?
• If we summarize all of the stats in a way that each club’s data are represented as coordinates in three dimensional space (using Principal Component Analysis technique), what would the data look like?
• How do the stats for each club compare with each other?
• Are these data useful to identify or guess the winning teams?
The visualizations or dashboard I built can be found below. My findings are presented below (readers can directly jump to the findings part). Readers can also explore the data for themselves using the visualizations and may discover different insights.
I analyzed 36 stats for each of the 32 clubs that participated in the finals round of the competition. The stats are:
• Key Stats: Won (total number of wins).
• Goals: Goals, Goals Right Foot (goals made using right foot), Goals Left Foot (goals made using left foot), Goals Head (goals made using head), Goals Other (goals made using body parts other that foot or head), Goals Inside Area (goals made inside penalty area), Goal Outside Area (goals made outside penalty area), Penalties Scored.
• Attempts: Total Attempts (for goal), Shots on Target, Shots Off Target, Shots Woodwork (shots hitting only the goal frame – post or crossbar), Shots, Crossbar (shots hitting only the top part of goal frame), Shots Post (shots hitting only the side part of goal frame), Shots Blocked.
• Distribution: Passing Accuracy, Passes Attempted, Passes Completed, Possession, Crossing Accuracy, Crosses Attempted, Crosses Completed, Free-Kicks Taken.
• Attacking: Attacks, Assists, Corners Taken, Offsides, Dribbles.
• Defending: Balls Recovered, Tackles, Tackles Won, Tackles Lost, Clearances Attempted.
• Goalkeeping: Saves, Goals Conceded, Own Goals Conceded, Saves from Penalties, Clean Sheets, Punches Made (punches made by goalkeeper to block the ball).
• Disciplinary: Fouls Committed, Fouls Suffered, Yellow Cards, Red Cards
The clubs are:
• England: Liverpool, Chelsea, Manchester City, Manchester United
• Austria: Salzburg
• Spain: Atletico, Real Madrid, Villareal, Barcelona, Sevilla
• Netherlands: Ajax
• Germany: Bayern, Leipzig, Wolfsburg, Dortmund
• Portugal: Benfica, Porto, Sporting CP
• France: LOSC, Paris
• Switzerland: Young Boys
• Italy: Inter, Juventus, Milan, Atalanta
• Sweden: Malmo
• Ukraine: Shakhtar Donetsk, Dynamo Kyiv
• Moldova: Sheriff
• Turkey: Besiktas
• Belgium: Club Brugge
• Russia: Zenit
Stats Data Distribution
To visualize the data distribution among the club stats, I divided each stats with number of matches played (except for Passing Accuracy, Possession, and Crossing Accuracy stats). The term ‘rate’ were then added to the resulting stats. I calculated this rate to get the average number each stats per match for each club. If I don’t perform this calculation, the stats would be greatly influenced by the number of matches played. For example, in the total number of goals, the clubs with more matches played will tend to have more goals than clubs with fewer matches. My assumption is that the performance of each club is generally well reflected in the averaged numbers. This assumption would be flawed if many of the clubs with low number of matches were performing significantly worse or significantly better in their few matches in the competition. For example, some club may played less than five matches and in that matches they committed more fouls than they usually do in their other matches outside of Champions League competition.
I calculated the correlations between the final standing (result) of each club and all other stats. I also calculated the correlations between all other stats. The result variable is the final standing of each club in the competition. The number (n) in the result variable means the round of n number of teams where that team still played in the competition. For example, Real Madrid and Liverpool were in the last round of 2 teams, so both of their results are 2. The lower the result number, the higher the position of the team. The correlations were performed using scikit-learn (sklearn) Python library. The correlations of the result would give an idea of what stats go hand-in-hand with good or bad results, or what stats don’t seem to be correlated with the result at all. The correlations don’t necessarily be causations or have any predictive power. They were calculated here only to give a general idea of the connection between the stats.
Summary of Stats Using Principal Component Analysis (PCA)
To summarize all the stats for each club, I performed Principal Components Analysis (PCA) technique on the data. PCA basically summarize the 36 stats using mathematical calculations resulting in only three variables for each of the 32 clubs. I then visualized these three variables in three dimensional plot to see if there are any patterns there, for example, whether the clubs are clustered according to their results, countries, or playing styles. The PCA were conducted using scikit-learn library. More details about Principal Component Analysis can be found at Wikipedia.
Lastly, I plotted all the stats for each clubs in a way that enables quick visual comparison of each club’s performance. To produce such visualizations, I first standardized each stats so that all the stats are in the same scale using StandardScaler function provided in scikit-learn library. After that, I visualized the standardized stats as radar charts for each clubs.
Finding #1: Most of the stats are normally distributed. Some of them may reveal clusters.
Most of the stats show roughly bell curve shape which is the characteristic of a normal distribution. Several stats look as if there are identifiable clusters, for example in Goals Outside Area, Passes Completed, Passing Accuracy, Balls Recovered, and Fouls Committed stats. Such cluster patterns may suggest that a distinct group of clubs were more aggressive or relying more on passing style of play.
Finding #2: The stats with high correlation with club’s final standing were mostly obvious factors. Several of the correlations may be worth more investigation.
Stats such as the number of wins and goals are obviously correlated with the final standing (result). Such obvious correlations may be less insightful. One interesting correlation is how the saves rate are negatively correlated with result. This may indicate that many of the clubs with not so good result had goalkeepers that played better than the defender players.
Finding #3: Based on the summarized data using PCA, many clubs with good final standings are grouped together. Other than this, there were no identifiable clusters.
Of eight clubs that have the final standing of less than 5, more than half (five clubs) are situated relatively close to each other in the PCA plot. This may suggest that club features that are commonly possessed by the winning clubs are reflected in the stats.
I didn’t do much investigation on the club profiles and the correlations between all of the stats. The many details in them would require a more meticulous observations. I welcome readers to have a look at them and find insights in the details.
To conclude this analysis, let’s address the question: Are these data useful to identify or guess the winning teams?
Based on this analysis, I think the stats are useful for identifying or guessing the winning teams if you know the right specific stats to look for. I think most of the stats were collected only to create talking points. Talking points are important too, though. Football would be less fun if there were less things to talk about or there were data that can straightforwardly predict the winning teams.
The visualizations for this analysis were built using Microsoft Power BI and Seaborn Python library. The raw data were copied from EUFA website. The data processing were conducted using Python programming language. Script for the data processing, the raw data, and the processed data can be found at my Github repository.