M Baas

I am a machine learning researcher at Camb.AI. I post about deep learning, electronics, and other things I find interesting.

3 January 2021

Heuristics for assessing Steam game reviews [Part 2]

by Matthew Baas

More statistical analysis of Steam review trends to uncover some more hidden heuristics.

TL;DR: We continue with our look at Steam review scores to datamine some more heuristics for assessing them in practice. We’ll start with looking at various statistical measures based on the overall review scores and what they tell us. After, we will look at the monthly reviews as random processes and answer questions about stationarity, periodicity, and other interesting info. The hope is that, after this post, you will be able to have a substantially better understanding of the trends in Steam review scores.

This post assumes you have read Part 1 and have knowledge of random variables, random processes, and some of the more simple measures that go along with it such as autocovariance and autocorrelation.

The data layout, concretely

The timeseries data for each game’s ratings is given in the figure below. Concretely, this is the data that has been scraped from Steam’s website and downloaded to the hists variable from Part 1. Concretely, the data for each game is a list of positive and negative reviews given for a game in each month since the game’s release. This data for 4 arbitrary games are shown:

Sample timeseries of reviews

Since games are not all released at the same time, the moment when this data starts is different for each game, as shown in the figure. To complicate the matters further, certain newer games released less than a year ago return a breakdown in weeks as opposed to monthly data. These challenges will be overcome in the second section of this post, and we will first have a deeper look at the total positive and negative reviews for a game over its lifetime so far.

1. Analysis of total review numbers

In the previous post we only looked at the distribution of the review score (the percentage of total reviews that are positive), dubbed $X$. But as we saw in the previous figure, we have more detailed information: we know the precise number of positive and negative reviews for a game during each month and over its entire life so far.

So, to reason about this additional data we need to introduce some more notation. Concretely,

Definition: define $U$ as the random variable which maps a game (the outcome) to the total number of positive reviews for that game over its lifespan so far.

Definition: define $D$ as the random variable which maps a game (the outcome) to the total number of negative reviews for that game over its lifespan so far.

Since the total number of reviews (positive or negative) for a game must be an integer, both $D$ and $U$ are discrete random variables and engender probability mass functions (PMFs) $f_U(u)$, $f_D(d)$, and a joint PMF $f_{UD}(u, d)$.

Given this data, the question arises, are the positive and negative reviews for a game correlated? If so, to what extent and does this yield any interesting heuristics for judging reviews?

1.1 Estimating the PMFs for $U$ and $D$

We use a similar method as discussed in Part 1 to estimate the PMFs from the total number of positive and negative reviews for each game in our sample set. Namely, we make two small adjustments:

The number of positive and negative reviews for games varies very wildly, from only a few hundred to several hundred thousand, so attempting to count the number of samples which fall into uniformly sized bins would be a fairly hopeless endeavor as the samples become increasingly sparse for games with higher numbers of reviews. To solve this, we instead use exponentially sized bins with larger bin widths for higher $U$ or $D$. Specifically, we use 100 bins uniformly spaced between 0 and 15 and find the count of samples where the natural logarithm of reviews ($\ln(u)$ and $\ln(d)$) fall into each bin.

Since we are estimating PMFs instead of PDFs, we tweak the function counts_to_pdf() from Part 1 by no longer normalizing by the bin width in a new counts_to_pmf() function:

def counts_to_pmf(N, bin_edges):
  """ Converts counts `N` in each bin demarcated by `bin_edges` to a valid PMF """
  delta = (bin_edges[2] - bin_edges[1])
  pdf = (N/sum(N))
  centered_bins = bin_edges[:-1] + (delta/2)
  return pdf, centered_bins

Performing this process for both $U$ and $D$ yield the PMFs shown below. Note again that we are not yet using the timeseries data, but looking only at the total number of positive and negative reviews over the lifetime of a game.

PMF of U(u)

PMF of D(d)

Looking at these distributions we might be tempted to conclude that negative reviews are more evenly spread out and very few games have low numbers of positive reviews. However, we must be careful since recall in Part 1 we only considered games with at least 150 reviews (the cutoff argument of filter_num_ratings()). Plotting the the joint distribution $f_{UD}(u, d)$ makes this more clear.

To find the joint distribution we use the plt.hist2d() function from matplotlib to find the counts using a 100x100 bin grid based on the same exponential bins used for the first-order PMFs. To visualize this joint PMF, we can draw a 2D image with $\ln(u)$ and $\ln(d)$ as the horizontal and vertical axes, and then each pixel will be colored to reflect the magnitude of the PMF at that point. Performing this yields the joint PMF:

PMF of D(d)

Brighter colors indicate a larger PMF value at that point, and recall that the PMF is normalized so that summing the values over each pixel in the image will yield one. The joint PMF above has a imminently visible arc beyond which no samples are found. This arc is precisely the curve where the total number of reviews are greater equal to 150 ($u + d = 150 \implies u = 150 - d$). We can express this in terms of $\ln(u)$ and $\ln(d)$:

\[\ln(u) = \ln(150 - e^{\ln(d)})\]

If you plot this out you will see it looks precisely like the cutoff arc in the joint PMF!

So, looking back at the PMFs for $U$ and $D$, we can’t take the values below 150 too seriously. What we can glean from this is the following: most haves have relatively few reviews, and only a handful of games have hundreds of thousands of reviews. Unfortunately that is pretty obvious – only a handful of games are super popular, and most games are fairly small in audience – and not useful as a heuristic. The purpose of finding these, however, was to use them to find downstream statistics, which might be of more use:

1.2 Correlation $R_{UD}$

We can now use the joint PMF from earlier to determine the correlation between $U$ and $D$. We expect this to be positive and fairly large, as more popular games will most likely have a greater number of positive and negative reviews. This means that as $U$ (the number of positive reviews) increases for a game, we would expect $D$ (the number of negative reviews) to also increase to some extent.

To confirm this, we compute the correlation between $\ln(U)$ and $\ln(D)$, retaining the natural logarithm to make our calculations easier. This should not affect our results as the $\ln$ operation is monotonic. Performing this yields:

\[R_{UD} = 32.82\]

Positive, as suspected. Unfortunately this still doesn’t tell us much – games with more positive reviews tend to have more negative reviews as well. This is not a super useful heuristic. So lets continue to covariance and correlation coefficients.

1.3 Covariance $C_{UD}$

Here we use the same PMF for $\ln(U)$ and $\ln(D)$ from earlier to calculate the expectation / first moment about the origin $\overline{\ln(U)}$ and $\overline{\ln(D)}$. We also compute the variance $\sigma^{2}_{\ln(U)}$ and $\sigma^{2}_{\ln(D)}$. Performing such calculations yields:

\[\overline{\ln(U)} = 6.48 \\ \overline{\ln(D)} = 4.81 \\ \sigma^2_{\ln(U)} = 2.15 \\ \sigma^2_{\ln(D)} = 2.18\]

It is slightly interesting that negative reviews have a greater variance than positive reviews, I suspect that this is because there are quite a few games which get either very few negative reviews, or tons of negative reviews (e.g. if the 1.0 game is released as a buggy mess). With this, we can compute the (unnormalized) covariance $C_{UD}$ and Pearson correlation coefficient $\rho$:

\[C_{UD} = \mathbb{E}[ (\ln(U) - \overline{\ln(U)})\cdot (\ln(D) - \overline{\ln(D)})] \\ C_{UD} = 1.63 \\ \implies \rho = \frac{C_{UD}}{\sigma_{\ln(U)}\sigma_{\ln(D)}} = 0.75\]

This information is much more definitive. We see that the Pearson correlation coefficient $\rho$ is close to 1, indicating a strong positive correlation between the total positive and negative reviews for a game. The only heuristic that can sensibly be derived from this is that games with more positive reviews tend to have more negative reviews as well, and vice versa. If a game does not follow this trend then it is unusual in some way.

This heuristic is not too useful in terms of actually browsing games in the Steam store, so I won’t linger on it. Let us continue our investigation to see if we can uncover any interesting and useful heuristics.

1.4 Interval conditioning $f_X(x)$ on popularity

When sorting through the data, the question arose “do more popular games have better reviews?”. One might hope that good games get better reviews and sales by word of mouth, and thus become more popular. We may also view it in a pessimistic light and hypothesize that more popular, mainstream games receive less scrutiny and less harsh critics due to the mainstream nature of the game compared to smaller, niche games. In both of these cases, we expect the average rating of more popular games to be higher than the average rating of less popular games.

So let us investigate whether this is the case. For this, we will return to the notation used in Part 1 for the % positive review score. That is, $X$ will be the random variable for the % of total reviews that are positive for a particular game. This $X$ has an associated PDF (since it is a continuous random variable) $f_X(x)$.

Determining conditional PDF

To investigate whether more popular games tend to have better review scores, we need to find the PDF of $X$ for games where the total number of reviews (which serves as a measure of popularity) is within some interval. We can then adjust this interval to see the PDF of games with different popularity. Concretely, we need to determine the conditional PDF $f_{X \vert \mathrm{total\ review}}(x | r_1 < \mathrm{total\ review} \leq r_2)$ for minimum and maximum total reviews $r_1$ and $r_2$ respectively. To find this for a given $r_1$ and $r_2$ and the number of positive and negative reviews for each game (as found in Part 1) we perform the following:

def filter_samples(pos_counts, neg_counts, r_1, r_2):
    mapper = np.logical_and((pos_counts + neg_counts) <= r_1, \
                            (pos_counts + neg_counts) > r_2)
    return pos_counts[mapper], neg_counts[mapper]

Given the positive and negative counts for only those games between $r_1$ and $r_2$ total reviews, we can simply compute the percentages, find the counts of percentages that fall into each bin, and compute the PDF using the precise same method as discussed in Part 1.

Sweeping $r_1$ and $r_2$ over wide range

Now that we can find $f_{X \vert \mathrm{total\ review}}(x | r_1 < \mathrm{total\ review} \leq r_2)$, let’s plot the PDF (like we did in Part 1) for various $r_1$ and $r_2$ to see the distribution of review scores for games with different popularity.

Concretely, let us plot $f_{X \vert \mathrm{total\ review}}(x | r_1 < \mathrm{total\ review} \leq r_2)$ for $r_1$ and $r_2$ chosen as a rolling 5 percentile window into the data, rolled two percentile at a time. I.e. if we order all the samples (games) by total reviews, find the PDF for each 5 percentile slice of the data, starting with the PDF estimated from all games having between zero reviews and the 5th percentile of total reviews. Then the next PDF will be estimated using games between the $r_1=$ 2st percentile to the $r_2=$ 7th percentile of total reviews, and so on.

Performing this process gives us 50 PDFs conditioned on the game popularity. So all this talk about getting conditional PDFs, so what? Well, so we can make this cool plot where each frame shows one of those PDFs as we sweep from the PDF conditioned on less popular games to the PDF conditioned on the most popular 5% of games:

Cool right? See how indeed more popular games have, on average, higher ratings. You can see how the expectation moves to the right for games with more reviews, so our hypothesis is correct! However, just as we excluded very small games from our analysis to remove outliers and astroturfed review scores, the handful of absolutely massive games (Cyberpunk, CSGO,…) will likely also be outliers and buck this trend.

Now this does actually allow us to concretely state another useful heuristic:

Heuristic: more popular games tend to have a better review score, so the popularity of a game (excluding very very large games) is another indication of its quality and enjoyment, separate from its review score.

Now I think we have dug sufficiently into the total review metrics. To uncover more we need to dig down into the timeseries data for monthly review scores, up next.

2. Analysis of monthly review numbers

Before we can analyze the monthly review numbers, we need to ensure that we have data uniformly spaced in time. By that I mean that the timeseries review data scraped in Part 1 is given monthly for most games, but weekly for some newer games. So, as a preprocessing step we need to downsample all the weekly review data to monthly review data.

This process is fairly easy using the Pandas and its pd.date_range() to generate a span of months that encompasses a game’s weekly data. Then, just add all the weeks’ review numbers to its corresponding month. The result of this process is a list of tuples in the format of (game_id, game name, total review metrics, monthly review metrics) stored in a variable downsampled_datas. The monthly review metrics simply contains a list of dicts for each month that the game has been on sale, containing the date, positive, and negative reviews for that month.

Using this data we will perform several investigations:

First, we will look at the random process visually as a PDF $f_X(x;t)$ – the PDF of the % positive review score for a given month.
Next, we will go on to investigate the stationarity of $f_X(x;t)$.
Further, we will make a comment about the periodicity and mean ergodicity of the random process.
Finally we will look at the power density spectrum (aka spectral power density) of $f_X(x;t)$ to further support our periodicity findings.

Let’s begin.

2.1 Finding the PDF $f_X(x;t)$

If we fix the time $t$ to, for example, June 2014, then the random process becomes a random variable and we have a regular PDF $f_X(x; \mathrm{June\ 2012})$. Estimating this distribution once we have specified a time becomes a fairly simple procedure:

Collate counts of all positive and negative review scores for all games that have been available to buy during the month $t$.
Compute the % positive review scores for the reviews in this month for each of these games.
Use this set of samples of % positive review scores to estimate a simple first-order PDF as was done in Part 1.

I decided to implement 1. and 2. as a python function get_monthly_data(), where the month is a python datetime object and represents the time $t$:

def get_monthly_data(month, datas):
    percs = []
    totals = []
    poss = []
    negs = []
    for _, _, _, r in datas:
        rollups = r['rollups']
        dts = [r['date'] for r in rollups]
        if int(month.timestamp()) in dts:
            ind = dts.index(int(month.timestamp()))
            s = rollups[ind]
            pos = s['recommendations_up']
            neg = s['recommendations_down']
            # only consider games where there are reviews in this month
            if pos == 0 and neg == 0: continue 
                
            percs.append(pos/(pos+neg))
            totals.append(pos+neg)
            poss.append(pos)
            negs.append(neg)
    percs = np.array(percs)
    totals = np.array(totals)
    poss = np.array(poss)
    negs = np.array(negs)
    return poss, negs, percs, totals

We can then combine this together with the method to perform step 3 from Part 1 to obtain the PDF at any arbitrary instant in time. To view how $X$ (the % positive review score) changes over time, let’s compute the PDF for each month since Steam released the review feature and plot each as a frame of an animation. Performing this (done with matplotlib again) yields:

We see that early on in Steam’s history games tended to receive much better review scores than they do these days, and in some spots in 2019 and 2018 the review scores are lower than they are at the time of posting. We also see several spikes in monthly review scores at certain key time periods. We will look at this more closely in the upcoming section on periodicity, for now lets look at the longer-term trends.

2.2 Expected review score over time

To get a simpler metric to interpret, lets plot how the expected % review score $\bar{X}(t)$ changes with time (i.e plot the movement of the red line in the vid above):

Expectation of random process over time

This solidifies our observation: up till 2014 reviews on Steam tended to be much much more positive compared to reviews from 2014-2019. And, in 2020 we can see monthly % review scores rise again. If I had to guess, I would suspect this has to do with COVID and the increased focus on gaming by wider society since outdoor events are limited. Or maybe games have just become better, or more critics have been silenced – not enough data to form an accurate idea about it.

In any event, the previous two graphs lend themselves to two important heuristics when looking at a game’s review score:

Heuristics

If a game has received most of its reviews before 2014, then its review score will be much (10-15%) higher compared to those released after 2014. If you believe the average quality of games on Steam has increased or remained the same, then ratings for older games which have most of their reviews in this period should be understood having review scores inflated compared to current games.
If a game has received most of its reviews between 2014 and mid 2019, its review score will be roughly 5% lower than comparable games with most reviews given after mid 2019. If you believe the average quality of games on Steam has roughly decreased or remained the same, then ratings for games which have most of their reviews during this time should be understood as having scores slightly deflated compared to current games.

Now this is more interesting indeed. Let’s continue.

2.3 Stationarity

Roughly speaking, a random process is stationary if its statistical properties do not change with time. Proving general ($N$-th order stationarity) is very hard to show, so let us just look at a more constrained view of stationarity, particularly wide-sense stationarity which requires three things hold for the random process in question:

The expectation / first moment about the origin $\bar{X}(t) = \mathbb{E}[X(t)]$ should be independent of time, i.e. $\bar{X}(t) = \bar{X}$.
The autocorrelation of $X$ needs to only depend on the relative offset $\tau$. In other words, the autocorrelation function $R_{XX}(t, t+\tau) = \mathbb{E}_X[X(t)X(t+\tau)]$ should satisfy $R_{XX}(t, t+\tau) = R_{XX}(\tau)$.
The second moment about the origin must be finite. With $t$ only taking on discrete values (the timeseries for each game gives us the positive and negative reviews for each month) and all the review scores being between zero and one, this is not a concern.

2.3.1 The expectation / first moment $\bar{X}(t)$

Good news and bad news. The good news is that we have already found and plotted this in the figure earlier in 2.2! The bad news is that we can clearly see from the plot of $\bar{X}(t)$ that $X(t)$ does not have a constant first moment, and thus it is not wide-sense stationary.

Well it is up to opinion whether review scores being stationary is good or not, but it certainly is less interesting since one would expect it to be non-stationary. Steam has changed their review system over time, and certain waves of better or worse games, or more generous or miserly reviewers have likely come and gone. Yet, deep inside, there was a small hope that some crazy Steam engineers might have specifically engineered their updates to the review system to try and keep the review score somewhat stationary.

Now, even though we know that $X(t)$ cannot be stationary, lets still continue to evaluate the autocorrelation and autocovariance to see whether they can yield any further interesting results.

2.3.2 Getting the joint distribution

This is the trickiest bit of our analysis and requires us to find the second order joint distribution $f_X(x_1, x_2; t_1, t_2)$. This function can be stored as a 4D tensor and represents the probability density that a single game has a review score $x_1$ at $t_1$ and also a review score $x_2$ at $t_2$. If we sum over $x_1$ and $x_2$, it should total unity regardless of the choice of $t_1$ and $t_2$.

To find an estimate of this I expanded our previous method for estimating a first-order PDF by creating a 4D tensor to store the counts for games all possible tuples $(x_1, x_2, t_1, t_2)$. So, for each unique tuple $(x_1, x_2, t_1, t_2)$ we look at each game which was available to buy at both $t_1$ and $t_2$, and then ask “did the game have a monthly review score of $x_1$ at $t_1$ and a monthly review score of $x_2$ at $t_2$?”. If yes, add one to the count. If no, do nothing.

Doing this naively like how I have explained has one big problem: it logically consists of 5 nested for loops, making it incredibly slow. Slow enough that I lost my patience and optimized it slightly to the following code:

# `global_rollups` is a list of rollups for all games
# `times` is a list of Python Datetime objects starting from the earliest 
#         reviewed game on steam to the current month. 
def find_nearest(array, value): 
    return (np.abs(array - value)).argmin()

f_xxtt = np.zeros((100, 100, len(times), len(times))) # x1, x2, t1, t2
timestamp_times = [int(t.timestamp()) for t in times]
for rollups in progress_bar(global_rollups):

    dts = [r['date'] for r in rollups]
    for i, d1 in enumerate(dts):
        ind1 = timestamp_times.index(d1)
        s1 = rollups[i]
        if s1['recommendations_up'] == 0 and s1['recommendations_down'] == 0: continue
            
        for ii, d2 in enumerate(dts):
            ind2 = timestamp_times.index(d2)
            s2 = rollups[ii]
            if s2['recommendations_up'] == 0 and s2['recommendations_down'] == 0: continue
                
            t1_perc = s1['recommendations_up']/(s1['recommendations_up'] + s1['recommendations_down'])
            t2_perc = s2['recommendations_up']/(s2['recommendations_up'] + s2['recommendations_down'])
            bin1_ind = find_nearest(bins, 100*t1_perc)
            bin2_ind = find_nearest(bins, 100*t2_perc)
            
            f_xxtt[bin1_ind, bin2_ind, ind1, ind2] += 1

Finally, to convert the counts to a valid 2nd order PDF, we divide each f_xxtt[:, :, t1, t2] slice by np.sum(f_xxtt[:, :, t_1, t_2]) to ensure a valid PDF regardless of the choice for $t_1$ and $t_2$.

2.3.3 Autocorrelation

With the second order joint distribution now constructed, finding the autocorrelation $R_{XX}(t, t+\tau)$ becomes a simple application of the definition:

\[R_{XX}(t, t+\tau) = \mathbb{E}[X(t)X(t+\tau)] = \sum_{\forall x_1, x_2}x_1 \cdot x_2 \cdot f_X(x_1, x_2; t, t+\tau)\]

We can compute this in code very simply using f_xxtt, recalling that bins is simply the list of 100 floats (as found in Part 1) which demarcate the center of each bin that we use to estimate a PDF of % positive review scores $X$:

def get_autocorr(t):
    """ Returns autocorrelation signal R_XX(t, t+tau) for specified t. i.e returned vector is indexed by tau"""
    R_xx = np.zeros(len(times))
    flog = False
    
    for tau_ind, tau in enumerate(range(-t, len(times)-t)):
        for x1ind, x1 in enumerate(bins):
            for x2ind, x2 in enumerate(bins):
                R_xx[tau_ind] += x1*x2*f_xxtt[x1ind, x2ind, t, t+tau]
    return R_xx
R_xxs = np.array([get_autocorr(i) for i in range(len(times))])

Now we can index $R_{XX}(t, t+\tau)$ with R_xxs[t, t+tau]. Now that we have it, let’s plot it. Concretely, let’s plot an animation where we have 123 frames and each frame corresponds to a month $t$ since the Steam review feature was released. At each frame, we will plot $R_{XX}(t, t+\tau)$ for free variable $\tau$ (recall $t$ is fixed in the current frame). $\tau$ will be the month offset added to $t$, thus for earlier $t$, most valid values of $\tau$ will be greater than 0, while for later $t$ (e.g. Dec 2020), most valid values for $\tau$ will be less than 0. Without further delay, such a plot yields:

This animation has lots of info to dive into:

$R_{XX}(t, t+\tau) \neq R_{XX}(\tau)$ as shown by the entire graph shifting up and down (at time fairly drastically) as $t$ is varied. Thus the autocorrelation is not only dependent on $\tau$ and certainly is non-stationary.
One notices a local maxima for $\tau=0$, which moves along with the plot as $\tau$ advances. This is entirely expected as we know that for stationary random process the autocorrelation has a global maximum at $\tau=0$. However, since the monthly % review scores $X(t)$ is certainly not stationary, we only observe a local maximum at $\tau=0$.
We again see echos of the first order moment $\bar{X}(t)$ plotted earlier, where for $t+\tau$ far in the past (2012), the autocorrelation is much higher due to the, on average, higher review scores given to games back then.
The sharp high local maxima present in the plot of $\bar{X}(t)$ is again present and even more prevalent in $R_{XX}(t, t+\tau)$. This further indicates some periodicity in the signal at somewhat regular intervals. We will leave a proposed hypothesis for such periodicity to our analysis of the spectral power density to support these findings. All the evidence of periodicity collected so far will support our final findings there and yield a very interesting heuristic.

It appears as though we cannot independently draw any interesting heuristics from this piece of information, but it is interesting nonetheless.

2.3.4 Autocovariance

Proceeding to the autocovariance of monthly review scores $C_{XX}(t, t+\tau)$, we can compute it again according to its (unnormalized) definition:

\[C_{XX}(t, t+\tau) = \mathbb{E}[(X(t) - \bar{X}(t))\cdot(X(t+\tau) - \bar{X}(t+\tau)) ] = \sum_{\forall x_1, x_2} (x_1 - \bar{X}(t))\cdot (x_2 - \bar{X}(t+\tau))\cdot f_{X}(x_1, x_2; t, t+\tau)\]

And the code to compute it is nearly identical to that for the autocorrelation:

means = np.zeros(len(times))
for j in range(len(times)):
    # Find the marginal distribution over x2 with .sum(axis=1)
    means[j] = sum([bins[i]*f_xxtt[:, :, j, j].sum(axis=1)[i] for i in range(len(bins))])

def get_autocov(t):
    """ Returns autocovariance signal C_XX(t, t+tau) for specified t. i.e returned vector is indexed by tau"""
    C_xx = np.zeros(len(times))
    
    for tau_ind, tau in enumerate(range(-t, len(times)-t)):
        for x1ind, x1 in enumerate(bins):
            for x2ind, x2 in enumerate(bins):
                C_xx[tau_ind] += (x1-nu_means[t])*(x2-nu_means[t+tau])*f_xxtt[x1ind, x2ind, t, t+tau]
C_xxs = np.array([get_autocov(i, R_xxs[i]) for i in range(len(times))])

Now, like with the autocorrelation, the autocovariance $C_{XX}(t, t+\tau)$ can be indexed with C_xxs[t, t+tau]. And we can follow the exact same reasoning for plotting $C_{XX}(t, t+\tau)$ as we did for $R_{XX}(t, t+\tau)$, which yields:

Now this is rather interesting. Recall that in each frame we are plotting the covariance between the % positive review scores for games at $t$ and $t+\tau$. So if a point at $\tau$ has a large autocovariance, it means that games with a higher % positive score $x_1$ at $t$ tended to also have a higher monthly % positive review score $x_2$ at $t+\tau$. With this in mind, we can make some interesting observations:

There is always a global maximum at $\tau=0$. This is as expected, since we are correlating $X(t) - \bar{X}(t)$ with $X(t+\tau) - \bar{X}(t+\tau)$, where both signals are zero-mean. And if we correlate two zero-mean signals with one another then the correlation with no offset $\tau=0$ will be a global maximum.

To get some intuition for this, if we are essentially considering a zero-mean signal when computing $C_{XX}$, then we are considering a first order stationary signal, and this can in many circumstances be approximated by a wide-sense stationary signal. And, we can prove that wide-sense stationary signals have a maximum autocorrelation at $\tau=0$. So, when we are looking at the autocovariance of the zero-mean $X(t) - \bar{X}(t)$ we will expect a maximum at $\tau=0$.
Apart from the maximum at $\tau=0$, for most of the time the rest of the values for $C_{XX}(t, t+\tau)$ are flat, or roughly constant at a somewhat small positive value. This can be interpreted as a game’s monthly review score being correlated with its future or past monthly review scores. I.e. if the game’s % review score for reviews this month are, say, 5% above the expected review score (e.g 82%), then we would expect next month’s review score to also be roughly 5% above the mean (82%).
At December 2013 we can see the right side of the graph shoot up – this is the lull in monthly review scores we observed in the plot of $\bar{X}(t)$. As the monthly review scores for most games started dropping around this time, it means that the future monthly review scores from after this drop correlate more strongly with other monthly review scores from after this drop (higher covariance) than with review scores prior (lower covariance).

This allows us to draw up one more slightly useful heuristic:

Heuristic: if a game has a good % positive review score in the last month, it is likely to continue having a good % positive review score in the next month. Or conversely, if a game has a bad review score last month it is likely to continue its poor review score next month.

Some cool data we have found so far, let’s investigate the last two avenues I wanted to look at.

2.4 Mean ergodicity

To be mean ergodic means that the time average of the monthly % positive review scores for any particular game converges to the expected value of the random process $\bar{X}(t)$ if we consider the game for enough months. So, is the monthly review score $X(t)$ mean ergodic? No, because it is not stationary. Since the random process is non-stationary it is not mean ergodic.

If we consider a very highly rated game (e.g Factorio), then its monthly % positive reviews certainly does not converge to the average $\bar{X}(t)$, especially since the average $\bar{X}(t)$ changes with time!

2.5 Power density spectrum

To fully understand the frequency content and thus periodicity of $X(t)$, we need to find its power density spectrum $S_{XX}(f)$. Consider the definition of $S_{XX}(f)$ for a continuous-time random process with autocorrelation $R_{XX}(t, t+\tau)$:

\[S_{XX}(f) = \mathbb{F}\bigg\{ \lim_{T \rightarrow \infty} \frac{1}{2T} \int_{-T}^T R_{XX}(t, t+\tau) dt \bigg\}\]

Where $\mathbb{F}$ is the Fourier transform (in this case on the variable $\tau$). Note that we could not use the simpler Wiener-Khinchin theorem to compute the spectral density because the random process is non-stationary. Since we have a discrete-time random process and autocorrelation, we may adapt the equation above to find an estimate of $S_{XX}(f)$ by swapping out the integral for a sum and using a discrete Fourier transform for $\mathbb{F}$:

\[S_{XX}(f) \approx \mathbb{F}\bigg\{ \lim_{T \rightarrow \infty} \frac{1}{2T} \sum_{t=-T}^T R_{XX}(t, t+\tau) \bigg\}\]

Using the code variables introduced earlier, we can compute this using the code:

inner_signal = np.zeros(len(times))

for t in range(len(times)):
    cnt = 0
    for ti, tau in enumerate(range(-t, len(times) - t)):
        cnt += 1
        inner_signal[ti] += R_xxs[t, t+tau]
        
inner_signal /= len(times)
S_xx = np.fft.fft(temp_signal)
freqs = np.fft.fftfreq(S_xx.size, d=1)

Since the time span between samples for monthly review metrics is 1 month, the frequency computed will be in cycles per month, or cycles/month. Now that we have S_xx we can plot the spectral power density!

Spectral power density of X(t)

Periodicity

From the plot above we can see that the majority of power is contained in the very low frequencies, which is what we expect. Remember the animation of % positive review scores for each month – they were all greater than 50%. In other words they all had a large DC offset, thus we expect the DC component of $S_{XX}(f)$ to be large – which it is.

Now look at the peaks of $S_{XX}(f)$: the first peak is at roughly 0.04-0.045 cycles/month, which indicate a periodic component of 22.2-25 months / cycle. Now look at the second sidelobe peak at 0.06-0.07 cycles/month (14-16 months/cycle). And further look at the third sidelobe peak at roughly 0.165 cycles/month (6.05 months/cycle).

Notice the trend? These are roughly biannual, annual, and semi-annual! In other words there is a strong frequency component aligned with annual boundaries. But what event that happens at regular half-year to full-year intervals could drastically influence how well people rate games? Steam sales.

In short: Steam sales mean cheaper games. Cheaper games means more games are worth people’s money (read: they feel their enjoyment of more games justified the price they paid). This means that compared to non-sale times, more people are happy with the value exchange for games during this time, and thus more positive reviews.

To get more concrete with this, lets plot $\bar{X}(t)$ again but indicate certain events during local maxima:

Spectral power density of X(t)

Now the pattern blares out quite a lot. During some steam sales reviews tend to be much more positive. What is peculiar is that not all Steam sales follow this trend equally. The Steam autumn sales appear to always follow this trend since 2016, and th summer sale sometimes follows it. However, the winter sale and halloween sales do not seem to make much of an appearance since 2016.

This could be that sales during the autumn sale are much better than sales during other periods, or that people are simply feeling more generous in their reviews and spending during the autumn sale compared to other sales (at least since 2016). In any event, this data allows us two final important heuristics:

Heuristics:

Steam review scores for games are roughly periodic with the Autumn and Summer sales, having spikes during these times. If the majority of a game’s positive reviews are given during sales, it likely means that the game is not worth one’s money outside of a sale. Conversely, if a game receives consistent positive review % scores each month even outside of a sale, then it is likely a good game worth one’s money regardless of whether it is on sale or not.
Steam reviews tend to be more positive during sales, especially Autumn sales. If you are buying a game during the time of an Autumn sale, the recent reviews will be much larger than the game typically gets, while if you are buying outside of a sale time the reviews will be deflated compared to during sale time.

Conclusion

Thanks for making it to the end! I hope the heuristics I have given are useful to you, or failing that, that the figures were at least interesting to look at. Even though the data was somewhat tricky to get and work with, since I suspect Steam does not really want people analyzing their data this much, I am thankful to Steam for allowing me some way to access it so that I could perform this analysis. And for Steam having a subscriber agreement much much better than alternative gaming platforms. And for the great games Steam makes.

Anywho, thanks for your time, and if you spot any errors I have made or would just like to ask a question, please get in contact with my via the About page. Cheers.

tags: steam - culture - probability - gaming