M Baas

I am a machine learning researcher at Camb.AI. I post about deep learning, electronics, and other things I find interesting.

13 December 2020

Heuristics for assessing Steam game reviews [Part 1]

by Matthew Baas

A few guidelines for judging a game’s Steam review score at a glance, by treating Steam reviews as random variables and processes.

TL;DR: Steam remains a gold-standard for gaming distribution platforms in 2020 due to its relatively excellent subscriber agreement and quality client. When looking for a game to buy on Steam, most people take the review score of a game into consideration. However, there exists a learning curve for a customer and their ability to interpret a game’s review score sensibly. In this post, I analyze various metrics about the total review scores on Steam by treating a few review metrics as random variables. After this post, hopefully you have accelerated where you are on the learning curve and have an easier time putting the Steam review score into context.

This post will assume knowledge of basic probability theory – such as Bayes rule, random variables, and random processes.

Caveat: I will not look any individual reviews or any individual games, nor any special confounding events that may occur in a game’s lifetime. For example, some games may attract more harsh critics, or some games might attract individuals that have a greater prior toward giving positive reviews. Or, as another example, a game might get review-bombed (positively or negatively) over a short period by activists if a game is associated with activity promoting or opposed to that activist group’s ideology. Instead, I will only look at average metrics across all games, and trust that Gaben and co at Steam have ensured that these effects are sufficiently minimized to not majorly effect the mean metrics over all games.

Motivation and notation

Let’s state the objective of this analysis more concretely. I wish to improve the accuracy of a customer’s judgement for their enjoyment of a game before they know anything outside of its review score (and maybe its review score over time). That is, let’s define the event that a particular game is worth your money as $a$. This event can have two outcomes – you either think the game is worth your money ($a=1$) or it isn’t ($a=0$). This assignment of outcomes of the event to numbers (1 and 0) correspond to a Bernoulli random variable $A(a)$ and a corresponding probability mass function $f_{A}(a)$ and cumulative probability function $F_{A}(a)$. Now, concretely, the probability that the game is worth your money $P(a = 1) = f_A(1)$.

And this post aims to, concretely, improve $P(a=1 | \text{review score})$ : the probability that a game is worth your money given only its review score (and maybe its review histogram). To expand this, we can use Bayes rule:

\[P(a=1 | \text{review score}) = \frac{P(\text{review score} | a=1) \cdot P(a=1) }{P(\text{review score})}\]

The term on the right hand side above consists of 3 factors:

$P(\text{review score} \vert a=1)$: the probability distribution of the review score of a game that you know is worth your money. This value is found by analyzing your own Steam library and checking the review scores on games you felt were worth your money. Then, using the techniques discussed further in this article, we can approximate this distribution.
$P(a=1)$: the probability that an arbitrary game is worth your money before you know anything about the game. If one is miserly or has little funds to spend on games, then this would be somewhat small, while if one readily spends significant amounts of money on games then this might be larger. In both cases, however, it will likely be very small. This is because there are thousands of games on Steam, and an arbitrary one sampled from them is unlikely to be the kind of game you enjoy.
$P(\text{review score})$: the probability distribution of an arbitrary game sampled from the Steam store.

Implicitly, when you judge whether a game is worth buying from the review scores, you mentally finding approximations for these factors. To make better decisions, one might then be tempted to improve the approximations for each factor. The first two factors above are subjective and depend on whether a game is worth the money for you. Thus it is quite hard to construct a general heuristic around that which works for everyone. So, instead, we will look at improving the approximation of the third term – our knowledge about the distribution of Steam reviews in general.

The data

To determine the probability distribution surrounding review scores, we will follow a data-driven approach to estimate them empirically.

To obtain the data, we will use a combination of the Steam API and scraping data from the Steam store. For this, I use Python with the requests package. Using this, I obtain the data in 3 steps:

3.1 Get a list of all applications on Steam

Each game (or application) on Steam has an associated numerical ID – the App ID of the game. We can use the Steam API to obtain a list of all applications on Steam together with their app ID’s, which we will use to obtain detailed review data in the following steps. To do this, we use the short bit of python code:

r = requests.get('https://api.steampowered.com/ISteamApps/GetAppList/v0002/?format=json')
applist = r.json()['applist']['apps'] # a long list of dicts, eg applist[20] = {'appid': 440, 'name': 'Team Fortress 2'}

Easy. According to the API there are 106 372 apps on Steam! Impressive. Thanks again Steam for having a subscriber agreement better for individual gamers than competitors.

3.2 Get the review summary of each app

We now use the API again to get the total amount of positive and negative reviews for each of these games. For this we use the appreviews api call for each game. We simply specify the app ID and it returns the summary of all reviews for that app. Concretely, the code to collect all of these using applist from above is:

appids = [a['appid'] for a in applist]
summaries = []
for a in appids:
    r = requests.get(f'https://store.steampowered.com/appreviews/{appid}?json=1&language=all&num_per_page=0')
    data = r.json()
    if int(data['success']) != 1: raise ConnectionRefusedError()
    summaries.append(data['query_summary'])

After this, summaries will be a list of dicts. As an example, for Half-Life 2: Episode 2, the summary dictionary is

    {'num_reviews': 0,
    'review_score': 9,
    'review_score_desc': 'Overwhelmingly Positive',
    'total_positive': 16169,
    'total_negative': 563,
    'total_reviews': 16732}

3.3 Get the review timeseries for each app

If one visits the Steam store in recent times and scrolls to the review section, they will see the option to view a graph of positive and negative reviews for each month since the game was released. This data is super interesting for analysis as well, so we grab it too by scraping it from the appreviewhistogram endpoint:

hists = []
for a in appids:
    r = requests.get(f'https://store.steampowered.com/appreviewhistogram/{appid}?l=english')
    data = r.json()
    if int(data['success']) != 1: raise ConnectionRefusedError()
    hists.append(data['results'])

Each histogram contains a list of dicts for each month since the game’s release, with each dict containing the total positive and negative reviews received during that month.

Pruning

Now that we have all the data, we need to do some pruning. The list of apps found in step 3.1 above includes really random apps that no-one plays, test applications not intended for release, and otherwise extremely niche content that would act as massive outliers in our analysis. To solve this, we will filter down the apps we consider to only those apps/games which have at least 150 reviews. This is done fairly trivially with the function:

def filter_num_ratings(applist, summaries, hists, cutoff=150):
    new_datas = []
    for (apid, aname), s, h in zip(applist, summaries, hists):
        if s['total_reviews'] < cutoff: continue
        
        new_datas.append((apid, aname, s, h))
    print("Filtered data contains only", len(new_datas), "applications ({:4.2f}%)".format(100*len(new_datas)/len(applist)))
    return new_datas

Filtering for a cutoff of 150 reviews brings the total number of games in our analysis down to 8825 games. We will use the review statistics from these games to construct the approximation of $P(\text{review score})$.

Analysis

Given the data we have obtained above, what we concretely have is the total positive and negative reviews for a game for each month since that game’s release. This setup lends itself well to two analysis’s: one on the total reviews over all time, and one using the review timeseries for each game.

Total review score

To get specific about each of these, we need to concretely define notation and meaning for $P(\text{review score})$. We will start with looking at the total positive and negative reviews over all time.

Notation for analyzing total reviews

The actual numerical total reviews of positive and negative are not of primary importance. Rather, when one visits a game’s store page, they are shown the ratio of total positive to total reviews over all time .

Let us define this ratio as $x$. From this definition follows the definition of a random variable $X(s)$ which maps a sampled game $s$ to a numerical value $x \in [0, 1]$ equal to the ratio of positive to total reviews. For example, if the we have the outcome $s$ corresponding to Dota 2, then $X(s) = X(\text{Dota 2}) = 84\%$ – Dota 2’s Steam review score.

The goal then becomes to find the probability density function (PDF) $f_X(x)$ and cumulative probability function (CDF) $F_X(x)$. Relating this to the previous notation, $f_X(x) = P(\text{review score})$.

Note: the distribution of review scores is a continuous distribution, since the ratio of positive to total reviews can take on any rational number between zero and one. If, however, we considered just the raw number of reviews as a random variable, it would be discrete as only an integer number of people can review a game.

Estimating the PDF and CDF

We estimate the CDF from the cumulative sum of the PDF, and the PDF from the histogram of review ratios $x$. From probability theory we know that, if given a list of counts in various bins for ranges of $x$ values, then we can estimate the probability that a new sample will fall into a bin by the proportion of all counts that fall into that bin. However, to transition from the probabilities that a new sample falls into each bin to a PDF, we need to normalize by the width of each bin, since

\[P(x_1 < x \le x_2) = \int_{x_1}^{x_2} f_X(x)\ dx\]

for a bin spanning between $x_1$ and $x_2$. Since we cannot obtain any more fine-grained detail about the PDF within this bin (by our construction we only know the term on the left hand side above), we must assume that $f_X(x)$ within this bin is identical for all values of $x$ within the bin. This means that we can simplify the above to:

\[P(x_1 < x \le x_2) \approx (x_2 - x_1) \cdot f_X(x_1)\] \[\implies f_X(x_1) = \frac{1}{x_2 - x_1} P(x_1 < x \le x_2)\]

Thus we can estimate the PDF for each bin by finding $P(x_1 < x \le x_2)$ from the proportion of games with $x_1 < x \le x_2$, and then normalize this by the bin width.

Code and results

To make our lives exceedingly easy, lets make the bin width 1 and use 100 bins, so we have bins for each percent from 0% positive review score to 100% positive review score.

Using the data gathered earlier, we obtain a list of % positive review scores for all the 8825 games in our dataset in a variable percs. With this, we can then find the counts falling into each of our bins with Numpy’s np.histogram() method:

bin_edges = np.arange(0, 101)
counts, _ = np.histogram(100*percs, bins=bin_edges)

Using the method described above, we can now define a function which takes in these counts and converts them to PDF:

def counts_to_pdf(N, bin_edges):
    delta = (bin_edges[2] - bin_edges[1])
    pdf = (N/sum(N) ) / delta
    centered_bins = bin_edges[:-1] + (delta/2)
    return pdf, centered_bins

Now we can finally plot the PDF with matplotlib:

PDF of % positive review score

From the PDF we can immediately see that reviews are heavily slanted toward being positive – most games have a positive review ratio above 80%. I have also indicated the expected value of the PDF, which semantically is what we would expect the % positive review score of a game to be if we chose the game at random. Concretely, it is:

\[\mathbb{E}_X[f_X(x)] = 80.15\%\]

We can also equivalently plot the CDF using the np.cumsum() function:

CDF of % positive review score

This is a very clean-looking CDF, and it shows that 50% of games have a review greater than 84% and 26% of games have a % positive review score greater than 90%. From these distributions we can now state some heuristics that well-versed Steam game buyers have learnt from experience:

Heuristics:

Reviews are pre-disposed to be very positive.
A % positive review score of 80-85% is entirely average.
A % positive review score below 80% is relatively bad, and a score below 70% is very bad score (in the bottom 20% of games).
A % positive review score above 85% is relatively above average, and a score above 92% is very good (in the top 20% of games).

So, when seeing a 85% positive review score for a game, remember to keep it in context and recall that 85% is about the average rating for a game.

Scaling by game size

The previous analysis might be seen as unfair, as a high rating for a super niche game with only a few thousand sales is weighted the same (in constructing the PDF) as a game with millions of sales. So, to account for this lets weight the counts described earlier according to the total reviews for each game with the weights argument of np.histogram. However, doing this results in the handful of massive games dominating the review scores, but chances are most games you play aren’t all the triple-A massive games. Rather, it is just that most people play them. So to be fair to smaller games, lets scale the counts by the square root of the total reviews for each game, which counts more popular games with a greater weighting, but not linearly so as to not make smaller games irrelevant in the final result.

Doing this yields a new PDF and CDF:

PDF of % positive review score

CDF of % positive review score

In this case, it seems that the expected review is even higher, with 50% of games being over 86% positive. For a game to be in the top fifth of games (in terms of positive review scores), it needs to have a % positive review score over 93%!

Analysis of the review histograms

There is a ton of valuable data in these histograms, and there is much to delve into. I will look into them in more detail in Part 2 of my analysis of Steam results, which will focus on a bit more theoretical properties of Steam reviews over time. In particular, we will look at the ratio of positive reviews over time as a random process and look at various metrics associated with it, such as questions around stationarity, autocorrelation, and more :).

Stay tuned, and I hope the PDF and CDF given above provide a little more content on what entails an average vs extraordinary review score for a Steam game.

Changelog:

2020-12-30: corrected a distinction in estimating PMFs vs PDFs from samples of a random variable. In particular the original version mistook the ratio of positive to total reviews as a discrete random variable, when it is actually a continuous random variable.
2021-01-03: Part 2 is now out! Go check it out under Posts!

tags: steam - culture - probability - gaming