FIFA 22 Dataset Analyzed

Selvyn Allotey
11 min readOct 2, 2021

Extra: Using R to explore FIFA 22 Dataset

Playstation

FIFA 22 Is Out!

FIFA 22 is finally out! Luckily I pre-ordered the game so for me, I have been on it for quite some time. Before the release of the game, FIFA had released their usual ratings for the players and this got me thinking. How does FIFA rate their players?

FIFA 22 official ratings have stated Messi as the highest-rated player in the game. Considering that I am a die-hard Ronaldo fan, I decided to do a little sleuthing of the players and their ratings to try to figure out who is the best player in the game and where to find the best players.

I found this great data set with all the players from Kaggle. It had a player dataset with player information and in-game stats that aided in my exploratory analysis of the best player in the game.

Walk Through Of The Exploratory Analysis

Basic Data Prep

I started by importing my favourite libraries to use in R. The tidyverse package gives you incredible data manipulation skills and visualizations since it contains the dplyr library and the ggplot2. You can read about these libraries using the links attached if you have no experience in them.

library(tidyverse)
library(e1071)
#Useful Function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
#Import Data
players <- read.csv("players_fifa22.csv")
teams <- read.csv('teams_fifa22.csv')

Alright, the data set should be imported into R using the code above. Before I start, I usually like to have a quick glance at the data and its structure. This can easily be done using the summary () and str() functions in R. Let’s you identify the measures of centrality quickly, NA values in the dataset, and also the data types of the variables in the dataset. I identified a few NA values and decided to drop them for this analysis.

#Check Structure 
summary(players)
str(players)
players <- players %>% drop_na()

Exploratory Analysis

I usually go through the variables one at a time to try and develop questions and find answers to them. Considering this dataset had many variables, I only selected a few I thought would be interesting to find answers to.

What is the distribution of player ages?

The distribution of player ages was quite symmetrical. Applying the mean to find a typical value in this set would be appropriate giving the symmetry of the histogram. I discovered that on average players in the game are about 27 years old. Using the skewness() function, I was able to determine the symmetry of the graph and also determine which measure of centrality would be appropriate.

#Histogram of Player ages
ggplot(data = players, aes(Age)) +
# step 2 add a histogram with custom breaks to match old graph
geom_histogram(breaks=seq(0, 60, by=2), colour = "black", fill = "dodgerblue")
#Checking Symmetry
skewness(players$Age)
#Finding Centrality
mean(players$Age)

Besides this, I was also interested in finding the distribution of player potential, total stats, and overall rating. I did that in practically the same way I did for the player ages. I discovered the average overall rating of all the players to be about 76. I also decided to use the mode for the total stats variable because after checking the skewness it looked to be extremely skewed to the left. A typical value in that set would be 2227.

#Histogram for Potential
gr1 <- ggplot(data = players, aes(Potential)) + geom_histogram(bins = 12, colour = "black", fill = "dodgerblue")+
labs(title = "Histogram of Player Potential")
# Histogram for Overall
gr2 <- ggplot(data = players, aes(Overall)) + geom_histogram(bins = 12, colour = "black", fill = "dodgerblue") +
labs(title = "Histogram of Player Overall")
# Histogram for TotalStats
gr3 <- ggplot(data = players, aes(TotalStats)) + geom_histogram(bins = 12, colour = "black", fill = "dodgerblue") +
labs(title = "Histogram of Player Total Stats")
# set bin width
# Plot three new ggplot objects together
ggpubr::ggarrange(gr1, gr2, gr3)
#Checking symmetry of data
skewness(players$Potential)
skewness(players$Overall)
skewness(players$TotalStats)
#Finding Centrality
mean(players$Potential)
mean(players$Overall)
median(players$TotalStats)

Moreover, I wanted to find out who the top oldest players are in the game. I was able to do this using the dplyr package in the tidyverse. I discovered the top 10 oldest players in the game. It did not come as a surprise that Cristiano Ronaldo was amongst the oldest players in the game. Not everyone can stay at the top level as long as he had.

#Oldest Player
players %>%
select(Name, Age) %>%
arrange(desc(Age)) %>%
slice(1:10)

Furthermore, I was curious about the most and least represented countries in the game, so I decided to create some bar plots to visualize the most represented countries in the game.

England was the most represented country in the FIFA 22. This might probably be due to the media coverage of the English Premier League. There might be players in other countries that should be represented but if players do not get the media coverage they would most likely not be in the game.

#Country with most players
players$Nationality <- as.factor(players$Nationality)
#Finding the top 10
top10_countries <- players %>%
group_by(Nationality) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10)
#Finding the bottom 10
bottom10_countries <- players %>%
group_by(Nationality) %>%
tally() %>%
arrange((n)) %>%
slice(1:10)
#Graph of most represented countries in FIFA 22
ggplot(top10_countries,aes(x=factor(Nationality),y=n))+
geom_col(color='black',fill='dodgerblue')+
labs(title = "Most represented countries in FIFA 22", x = "Countries", y = "Count")
#Graph of least represented countries in FIFA 22
ggplot(bottom10_countries,aes(x=factor(Nationality),y=n))+
geom_col(color='black',fill='red4')+
labs(title = "Least represented countries in FIFA 22", x = "Countries", y = "Count")

The least represented country was Russia in the game, I do believe a strong predictor of a country’s representation has to do with the media coverage of their respective leagues.

Who is the best player?

In addition, I was looking to discover who the best FIFA character is. Does the overall rating mean that the player has the highest total stats? Should the potential of a player be a better indicator of who the best player in the game is?

The Best Overall Ratings

#Highest Overall rating 
players %>%
select(Name, Overall) %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Highest Total Stats
players %>%
select(Name, Overall, TotalStats) %>%
arrange(desc(TotalStats)) %>%
slice(1:10)
#Highest Potentials
players %>%
select(Name, Potential) %>%
arrange(desc(Potential)) %>%
slice(1:10)

The figure above shows the top 10 players with the highest overall ratings. So obviously one would think that this should be the player with the highest total stats but to my surprise that was not the case. The figure below would illustrate who the players with the highest total stats are alongside their overalls.

Bruno Fernandes, the Portuguese maestro, has an 88 overall but has the highest total stats in the game. So, this begs the question, how is FIFA calculating the overall rating. I looked this up and found an article by Goal explaining how their ratings are calculated and it seems as though they’re factoring in international recognition of players. It is unknown just how influential this aspect of the rating contributes to the overall rating.

The players with the highest potential ratings were also quite different from the actual players with the overall ratings as shown in the figure below.

Kylian Mbappe has the highest potential in the game. So is he actually the best player in the game? It is difficult to tell as from my experience the game sort of overpowers players and the ratings are not really a reflection of their in-game abilities.

Top Leagues

Which league has the most players? Which league has the best players? I wondered as I continued looking at the data set. This would really help in FUT team selections in building good teams. This was actually the only time I used the team’s data set as I had to join the data to find the players and their respective leagues.

#Joining team data set to find their respective leagues#Renaming column name to have a common column
names(teams)[2] <- 'Club'
#Joining the Dataset
players <- players %>% full_join(teams, by="Club")
#Removing Unwated columns from dataset
players <- players[, -91]
players <- players[, -c(92:102)]
#Rename Overall.x back to Overall
names(players)[9] <- 'Overall'

After joining the data set, I removed the unwanted columns and renamed a few columns to make the manipulation more consistent.

I used the tidyverse to aggregate the frequency of players per leagues and selected the top 10 and also used ggplot to plot this bar chart.

#Finding Most represented leagues in the game
top10_leagues <- players %>%
group_by(League) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10)
ggplot(top10_leagues,aes(x=factor(League),y=n))+
geom_col(color='black',fill='dodgerblue')+
labs(title = "Most represented leagues in FIFA 22", x = "Leagues", y = "Count") +
coord_flip()

As expected the English premier league is the most represented league in the game. Surprisingly, the only second division league to make it into the top 10 in the English League Championship. This further illustrates just how much media coverage can increase your representation in games like FIFA 22.

Which leagues have the best players?

The players in the top 5 leagues rated 85 and above were discovered using tidyverse for the manipulation and the ggplot to plot the visuals.

#Players in the top 5 leagues 85 rated and above
the85andabove_league <- players %>%
filter(Overall >= 85) %>%
group_by(League) %>%
tally() %>%
arrange(desc(n))
#Leagues with high potentials
the85potentials_league <- players %>%
filter(Potential >= 85) %>%
group_by(League) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:5)
#Statistically the best
bestwithstats_league <- players %>%
filter(TotalStats >= 2100) %>%
group_by(League) %>%
tally() %>%
arrange(desc(n))%>%
slice(1:5)
#Graph representing the 85+ rating count by league
ggplot(the85andabove_league,aes(x=factor(League),y=n))+
geom_col(color='black',fill='dodgerblue')+
labs(title = "Leagues with the Best Rated Players (85+)", x = "Leagues", y = "Count")
#Graph representing the potentials by league
ggplot(the85potentials_league,aes(x=factor(League),y=n))+
geom_col(color='black',fill='dodgerblue')+
labs(title = "Leagues with High Potential Players", x = "Leagues", y = "Count")
#Graph representing best players with high total stats
ggplot(bestwithstats_league,aes(x=factor(League),y=n))+
geom_col(color='black',fill='dodgerblue')+
labs(title = "Leagues with Best Players by stats", x = "Leagues", y = "Count")

The EPL has the highest-rated players, this makes me wonder if the other leagues have worse players or if the English media coverage aids with the international recognition factor when rating the players.

Which clubs have the best players?

Manchester City appears to have the best rated players, as well as the players with the highest potentials. However, surprisingly, the best players by total stats come from Paris Saint-Germain and Real Madrid. I would assume Paris Saint-Germains would be the highest due to their very much successful transfers.

#Clubs with the best players#Clubs with players 85 rated and above
the85andabove_club <- players %>%
filter(Overall >= 85) %>%
group_by(Club) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10)
#Clubs with high potentials
the85potentials_club <- players %>%
filter(Potential >= 85) %>%
group_by(Club) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10)
#Clubs with Statistically the best players
bestwithstats_club <- players %>%
filter(TotalStats >= 2000) %>%
group_by(Club) %>%
tally() %>%
arrange(desc(n))%>%
slice(1:10)

Best Players by Position Using Overall Ratings

#Best Players by position
#Best ST
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "ST") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best RW
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "RW") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best LW
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "LW") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best CAM
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "CAM") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best CMs
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "CM") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best CDM
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "CDM") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best LB's
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "LB") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best RB's
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "RB") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best CB
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "CB") %>%
arrange(desc(Overall)) %>%
slice(1:10)
#Best GK
players %>%
select(Name, BestPosition, Overall) %>%
filter(Overall >= 85, BestPosition == "GK") %>%
arrange(desc(Overall)) %>%
slice(1:10)

ST

LW

There are surprisingly not a lot of top-rated LW players since there were just 6 players who reached 85+.

RW

There were also just 4 players exceeding an 85+ rating who are RWs.

CAM

CM

CDM

Only 6 CDMs to make this list.

LB

A shocking only 2 LBs are rated 85+. One would expect fullbacks to be higher rated given the work you’re expected to put in as a full back defensively and now in the modern game, offensively.

RB

The same goes for the RB position. I would expect their numbers to be higher given the responsibility on them in the modern game.

CB

GK

This is the end of my exploratory analysis of this FIFA 22 data set. There have been a lot of surprising results. I just might consider becoming a FIFA Talent Scout or Data Reviewer as I believe some of the ratings are rather skewed because of the international recognition factor FIFA uses. At the end of the day, I will very much still be playing the game regardless.

--

--

Selvyn Allotey

Networking | Cybersecurity | AWS Cloud | Digital Forensics