M Baas

I am an E&E engineering PhD student at Stellenbosch University. I post about deep learning, electronics, and other things I find interesting.

18 October 2022

Hierarchical Global Style Tokens - A Practical Introduction

by Matthew Baas

A short practical introduction to the hierarchical global style token (HGST) layer occasionally used in speech processing tasks.

TL;DR: global style tokens (GST) are a fairly well-known information bottleneck layer used in speech processing models. A less well-known extension to GSTs by the same authors is that of hierarchical global style tokens (HGST). In this post I give a practical introduction to HGSTs and my code implementation of the HGST layer. This post assumes you vaguely know about deep learning and speech processing, and have maybe heard of GSTs before.

Hierarchical Global Style Tokens - A Practical Introduction

HGST was introduced in “Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis” by Xiaochun An and others. Their idea was to add a hierarchical extension to the previously introduced GST technique in “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis” to allow for more fine-grained control of the synthesized utterance.

1. Background: global style tokens

The GST layer is fundamentally an information bottleneck layer. It projects an input vector to an output vector using a fixed, small number of learned basis vectors. The GST was introduced for speaker embeddings to attempt to better learn a disentangled representation of speaker identity by enforcing an information bottleneck on the incoming speaker embedding, thereby forcing the model to remove all unnecessary (i.e. linguistic) information entangled in the input vector.

This is done using an attention operation, whereby the input vector acts as a single query and the learned set of embeddings act as both the key and value sequences. In diagram form, the GST layer is defined:

Global Style Token layer

Where \(E\) is an \(N \times d\) matrix of \(N\) learnt embedding vectors, each of dimension \(d\). Typically \(d\) is in the range of 128-1024, while \(N\) is typically very small, between 5 and 32. The amount of embeddings \(N\) determines how severe the information bottleneck is – the lower it is, the fewer basis vectors the layer can use to represent the input vector \(\mathbf{v}\). So, if \(N\) is too large, then the purpose of the layer is forfeit as you are not imposing any information bottleneck. The output vector \(\mathbf{s}\) is then a convex combination of the embedding vectors.

Intuitively, the GST layer attempts to represent the entangled input vector \(\mathbf{v}\) containing speaker information as a convex combination of the GST layer’s style embeddings \(E\). This forces the rank of the vector space in which \(\mathbf{s}\) resides to be at most \(N\), enforcing a low-dimensional subspace constraint on the output. So, if the model can obtain information easier elsewhere (namely content or linguistic information), then it will learn to remove that information from the output embedding, retaining only the critical information that cannot easily be obtained from elsewhere.

The original GST authors found that – after training a voice conversion model where the speaker embedding was fed through a GST layer – the GST embedding vectors corresponded to recognizable characteristics of speaker identity. For example, scaling the weighting of one column of $E$ would control the masculinity or femininity of the speaker, while another would control the overall loudness to a substantial extent.

2. HGST: recursively approximating the residual

While GSTs work well and have been used fairly widely since the original publication, one particular extension considers whether we can impose a hierarchical constraint on the information encoded in these embedding vectors $E$. Concretely, Xiaochun An et. al. noted that GST embedding vectors still contain a mixture of style attributes and do not consider the hierarchical nature of speaker identity. For example, we can imagine that a speaker identity can be broken up at the top level into male/female, then into specific male/female voice, and then more fine-grained attributes like speaker intonation (e.g. slow, fast, whisper). GST vectors are not able of capturing this relationship of coarse-to-fine speaking style decomposition.

To remedy this, Xiaochun An et. al. introduced the Hierarchical Global Style Token (HGST) layer in “Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis” in 2019 to model different levels of abstract concepts of speaker identity with several continuous latent spaces. Essentially the HGST layer is a series of stacked GST layers where the vector being approximated by each GST layer is the residual – i.e. the difference between the prior approximation and the input vector. And, like the GST, the HGST layer accepts a single vector as input and produces a single vector as output, with the aim to disentangle speaker information contained in the input vector such that the linear directions in the vector space spanned by its output vectors are associated with common factors of speaker identity variation.

Hierarchical Global Style Token layer

The concrete specification of the HGST layer is given above. The layer consists of \(l\) global style token (GST) sublayers, each of which has a trainable matrix of \(N\) style vectors. Intuitively, the first GST layer attempts to represent the entangled input vector \(\mathbf{v}\) containing speaker information as a convex combination of the first GST layer’s style embeddings \(E_1\). This forces the rank of the vector space containing the output vector of the first GST layer to be at most \(N\) (since it is expressed as a linear combination of \(h\) style vectors), enforcing a low-dimensional subspace on the output. Then, the residual difference between the original vector \(\mathbf{v}\) and the approximation produced by the first GST layer is then fed to a second GST layer for another round of approximation. This process continues in this way until the last sublayer \(l\), where the vector \(\mathbf{v}\) has been recursively approximated by low-rank subspaces.

If the latent space in which the final output style vector \(\mathbf{s}\) exists is to be disentangled, then the residual approximated by each of these GST layers – acting as information bottlenecks – must encode common factors of speech variation in a hierarchical fashion, hence the name HGST. An important question to ask is ‘why must the embeddings of each GST layer approximate the input vector?’ While tricky to observe at first glance, if one looks closely at the form of each GST layer, the output of the attention layer (i.e. a convex combination of the value and thus embedding vectors) is summed with the residual. For further-on GST layers to effectively learn from this summed output, they must encode similar information about speech variation.

More technically, during model design, if we enforce the design constraint that speaker information can only be obtained through this final style vector \(\mathbf{s}\) (like with many voice conversion systems such as this), and the model is trained in some way to match the synthesized speech with a provided ground-truth sample, then we can show that each GST sublayer learns to recursively approximate the entangled input vector \(\mathbf{v}\).

In this way, a single vector \(\mathbf{v}\) is expressed as the sum of \(l\) convex combinations. The original HGST authors found that the embeddings and residuals for each sublayer in the hierarchy encode different aspects of speech. For example, for \(l = 3\) sublayers with \(N = 5\), they found the first layer broadly encodes speaker gender and course speaker characteristics, while the second encodes more fine-grained speaker identity, and the last sublayer encodes fine-grained details such as speaker microphone quality and noise. This ability to learn low-rank subspaces corresponding to common speaker characteristics makes this a key technique when designing models to learn a disentangled latent space over speaker identity (or other characteristic of interest).

3. Practical implementation in Pytorch

While the theory of HGSTs is very interesting, the original HGST paper is likely the best place to get the full details. There is a lack, however, of practical guides on how to implement an HGST layer in code. Here we aim to provide a basic pytorch implementation of an HGST layer.

The first step is to define the GST layer with hooks for residual connections as shown in the HGST figure earlier. Below is one such implementation in Pytorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

class GST(nn.Module):
    Global Style token layer

    def __init__(self, N, dim, num_heads=1):
        if num_heads != 1: raise AssertionError("Only 1 head supported for GST layers")
        self.embed = nn.Parameter(torch.FloatTensor(N, dim // num_heads))
        torch.nn.init.normal_(self.embed, mean=0, std=0.5)
        self.stl_dim = dim
        self.attention = nn.MultiheadAttention(dim, num_heads)

    def forward(self, inputs, residual=True, r=None, return_weights=False):
        """ The forward method for GST with residual connection, assumes `input` of shape (bs, seq_len, dim) """
        bs = inputs.size(0)
        # attention input: get input/residual vector
        if residual: query = (r - inputs).unsqueeze(0)  # (L=1, bs, dim)
        else: query = inputs.unsqueeze(0) # (L=1, bs, dim), skip residual subtraction if first sublayer.
        # keys are tanh of learnt embeddings
        keys = torch.tanh(self.embed).unsqueeze(1).expand(-1, bs, -1)  # (N, bs, dim) 
        # compute the style embedding
        style_embed, w = self.attention(query, keys, keys) # (1, bs, dim)
        style_embed = style_embed.squeeze(0) # (bs, dim)
        # output: sum input with output for residual connection
        if residual:
            if return_weights: return (style_embed + inputs, w)
            else: return style_embed + inputs
            # skip sum if first sublayer to save an operation
            if return_weights: return style_embed, w
            else: return style_embed

The full HGST layer is then simply a stack of these modified GST layers:

class HGST(nn.Module):
    def __init__(self, l, N, dim):
        self.stls = nn.ModuleList([GST(N, dim) for i in range(l)])
        self.stl_dim = dim

    def forward(self, v):
        """ Forward through HGST layer with `v` input of shape (bs, dim) """
        for i, layer in enumerate(self.stls):
            if i == 0: style_embed = layer(v, residual=False)
            else: style_embed = layer(style_embed, residual=True, r=v)
        return style_embed # output s style vector, (bs, dim)

Not too hard! This also is numerically stable and works well with fp32/fp16 mixed precision training. You can also go on and add utility functions to manually specify the attention weighting of each embedding vector at inference to get a feel for the kind of characteristics encoded by each embedding vector.

For an example of this code in action in a trained model, my paper at Interspeech 2022 uses HGSTs to disentangle speaker identity from both language and linguistic content to allow for voice conversion on unseen languages without any text information using \(N=5\) and \(l=3\). After having trained my Tacotron2-like model, like the original HGST authors, I observed that embeddings \(E\) in earlier GST sublayers encoded information about course speaker identity/style, while layer GST sublayers encoded more fine-grained information about speaking style. In summary, the HGST layer can be a useful layer to use if you want to train a network to hierarchically disentangle some information contained in a vector.


Global style token layers are a good way to impose information bottlenecks in networks and provide a nifty method to disentangle common factors of variation in the input vector. Hierarchical style token layers take this a step further by cascading GST layers to learn a hierarchy of disentangled embeddings corresponding to coarse (for earlier sublayers) to fine (later sublayers) variations of the input vector. In the context of voice conversion or speech synthesis, this is perfect for disentangling speaking style information from the actual linguistic content (i.e. specific words spoken) – the primary reason for the introduction of both GST and HGST layers. However, they may be useful elsewhere where you wish to impose a structured information bottleneck and learn a disentangled representation of a particular latent space.

Hopefully the code snippets above help you along your way to implementing the HGST layer, and may your training go well!

Thank you for reading.

tags: machine learning - reimplementation - speech processing - convex approximation - attention