PLdle Difficulties
PLdle is a game built on a comprehensive dataset of players sourced from Transfermarkt. This has resulted in a vast amount of data and a meticulous process to evaluate the difficulty of guessing any given player.
The initial version of this game relied solely on the number of matches played and the seasons in which the player participated. Although this approach is relatively simple, it makes the classification of players into difficulty levels more intuitive for the player.
After some testing, I was advised to implement a weighted scoring system that also accounts for the player's nationality. With over 1,500 players from England in the dataset, correctly guessing an English player on the first try becomes significantly harder due to the large pool.
The weighted scoring system led to the current difficulty calculation. I will explain this process, describe how each scoring function was determined, and provide examples of players at each difficulty level.
Overall Goal
The goal is to divide the dataset so that the easiest subset of players is categorized as Easy, the next subset as Medium, and so on. Using scoring functions for each metric, the final difficulty is calculated by assigning weights to each measure and normalizing the result to a score between 1 and 100, with 100 being the most challenging.
Games Played (Dg)
The first metric for determining difficulty is the number of matches played. Players with 0 matches are excluded, so the minimum is 1, while the maximum is 653, currently held by Gareth Barry.
To establish the scoring for this metric, I analyzed the distribution of players by the number of matches played. Many players have played only 1 match, fewer have played 2 matches, and so on, creating an inverse trend. After reviewing the data, I observed that players with around 150 matches or more tend to be more recognizable. Thus I attempted to model a drop-off towards 150 games, and then a slower descent after:
Latest Season Played (Ds)
The second metric evaluates difficulty based on the player's most recent season. The dataset covers players from the inaugural Premier League season in 1992 up to the 2023/2024 season. Approximately 100-200 players concluded their Premier League careers each season, except in 2023/2024, where around 800 players are considered to be in their final season due to the structure of the dataset.
Because the distribution of players by their final season is relatively linear, I chose a logistic function to model a slower "drop-off" in difficulty for the later seasons. The difficulty is already high for these players, and the function reflects this trend:
Nationality Uniqueness (Dn)
The final metric considers nationality uniqueness in the dataset. Determining which nationalities are harder to guess is complex, as some countries may have many players who are well-known, while a nation with only one player might be obscure to all but the most dedicated fans.
For simplicity, nationalities are treated independently. For instance, the single player from Guatemala is considered easy to guess due to the low representation, irrespective of their popularity. The scoring heavily favors countries with fewer players and increases rapidly:
Final Calculation (D)
After calculating the individual scores, the total score is weighted to give more importance to the number of matches played. The final formula is:
The resulting difficulty thresholds and corresponding dataset percentages are as follows: Easy (≤56) (15%), Medium (56-68) (20%), Hard (68-79) (25%), and Extreme (79-100) (40%).
Examples
That was a lot of information. Here are some examples to demonstrate how the scoring affects well-known (and less well-known) players: