M Baas

I am an E&E student at Stellenbosch University. I post about deep learning, electronics, and other things I find interesting.

8 July 2019

Danbooru2018 pytorch pretrained models

by Matthew Baas

Pretrained PyTorch Resnet models for anime images using the Danbooru2018 dataset.

TL;DR:

Results

result output image

The models were trained using 90% of the Danbooru2018 dataset (or the 36GB Kaggle subset), with the remaining 10% being used for validation. Note: the data was split at random, using a seed of 15 for np.random.seed() (see training notebook for how to reproduce the exact split).

The results for each network are shown in the table below, with the metrics being evaluated from the validation set.

Model Cross entropy loss Accuracy (0.4 threshold) F2 score (0.4 threshold) #tags Training time GPU
Resnet18 0.2793 0.8980 0.2827 100 ~2 days 1 Tesla K80
Resnet34 0.1147 0.9665 0.3307 500 ~2 days 1 Tesla K80
Resnet50 0.0137 0.9964 0.3831 6000 ~7 days 1 Tesla V100

For best results please use the Resnet50 model, since it is trained on the full dataset and generally performs much better. The #tags is the number of most popular tags (in the dataset) that the networks were trained to predict. e.g #tags being 6000 means the networks were trained to predict tags using the top 6000 most frequently occurring tags in the Danbooru2018 dataset.

Note that since the number of tags that each model is trained with is different, comparing their metrics is not straightforward, however qualitatively the resnet50 model performs significantly better than the other two.

Resnet50 bonus features

The resnet50 model has the age rating treated as a tag, so the resnet50 model doubles up as a ‘safe’/’questionable’/’explicit’ classifier using the corresponding tags of age_raitng_s, age_rating_q, age_rating_e.

The score attribute of each image in the Danbooru2018 metadata is also treated as a tag, having the set of associated tags meta_score_x where x is the score from 1-5 (e.g images with a score: 5 in the metadata will have the tag meta_score_5). Although I have found this feature to not be particularly effective just yet.

Getting started

First we download a sample image to try on

import urllib, urllib.request
url, filename = ("https://github.com/RF5/danbooru-pretrained/raw/master/img/egpic1.png", "egpic1.png")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

1. The pure pytorch way

PS: this can be done out of the box in google colab, or any python environment with pytorch installed.

import torch
model = torch.hub.load('RF5/danbooru-pretrained', 'resnet50')
# or 
# model = torch.hub.load('RF5/danbooru-pretrained', 'resnet18')
# or
# model = torch.hub.load('RF5/danbooru-pretrained', 'resnet34')
model.eval()

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 64. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225] for the resnet18/resnet34 models, and a mean = [0.714, 0.663, 0.652] and std = [0.297, 0.302, 0.298] for the resnet50 model.

Here’s a sample execution using the example resnet50 model and the image from earlier:

from PIL import Image
import torch
import torch.nn.functional as F
from torchvision import transforms
input_image = Image.open(filename)
preprocess = transforms.Compose([
    transforms.Resize(360),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.7137, 0.6628, 0.6519], std=[0.2970, 0.3017, 0.2979]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)

# The output has unnormalized scores. To get probabilities, you can run a sigmoid on it.
probs = torch.sigmoid(output[0]) # Tensor of shape 6000, with confidence scores over Danbooru's top 6000 tags

We can now plot the outputs next to the input image

import matplotlib.pyplot as plt
import json

with urllib.request.urlopen("https://github.com/RF5/danbooru-pretrained/raw/master/config/class_names_6000.json") as url:
    class_names = json.loads(url.read().decode())

plt.imshow(input_image)
plt.grid(False)
plt.axis('off')

def plot_text(thresh=0.2):
    tmp = probs[probs > thresh]
    inds = probs.argsort(descending=True)
    txt = 'Predictions with probabilities above ' + str(thresh) + ':\n'
    for i in inds[0:len(tmp)]:
        txt += class_names[i] + ': {:.4f} \n'.format(probs[i].cpu().numpy())
    plt.text(input_image.size[0]*1.05, input_image.size[1]*0.85, txt)

plot_text()
plt.tight_layout()
plt.show()

result output image

2. The fastai way

If you already have fastai installed, you can use their load_learner() function after downloading the exported model of your choice from the github release page:

from fastai.vision import *
# if you put the resnet50 file in the current directory:
learn = load_learner(path='./', file='fastai_danbooru_resnet50.pkl')
# or any of these variants


model = learn.model # the pytorch model
mean_std_stats = learn.data.stats # the input means/standard deviations
class_names = learn.data.classes # the class names

# Predict on an image
img = open_image('egpic1.png')
predicted_classes, y, probs = learn.predict(img)
print(probs)
>> tensor([1.0000e+00, 1.0000e+00, 2.2001e-13,  ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00])

Training

Training procedure

Training was done using the various tricks of the fastai library, namely:

Networks

The networks are all standard Resnets with the network’s body defined the same as the torchvision.models definition.

Each model consists of a body, the output of which is fed into a ‘head’ network which predicts the final tags. For each resnet, the body is the full resnet except the last 2 layers, which are removed since they are specific to the 1000 ImageNet classes. In particular the final pooling and linear layer are removed and replaced by a head to accommodate for the 100, 500, or 6000 tag outputs. Specifically, the original last two layers (the non-convolutional part) of torchvision’s ResNet models:

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=<some number>, out_features=1000, bias=True)

Becomes the following for the danbooru resnet models:

  (1): Sequential(
    (0): AdaptiveConcatPool2d(
      (ap): AdaptiveAvgPool2d(output_size=1)
      (mp): AdaptiveMaxPool2d(output_size=1)
    )
    (1): Flatten()
    (2): BatchNorm1d(<some number>, ...)
    (3): Dropout(p=0.25)
    (4): Linear(in_features=<some number>, out_features=512, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(512, ...)
    (7): Dropout(p=0.5)
    (8): Linear(in_features=512, out_features=<#tags>, bias=True)
  )

For details see the Fastai create_head() function, which replaces these final two layers with an AdaptiveConcatPool2d (see this link for details) followed by a Flatten layer, fed into 2 rounds of batch normalization, dropout, linear, and ReLU. The first linear layer maps the number of final output features from the AdaptiveConcatPool2d to double the number of output activations of the body, and the second linear layer maps these activations to 512 activations, then a final linear layer is added which maps the 512 activations down to the final number of classes — #tags in the table.

Reproducing, comparing, and source code

For ease of training, I only used .png and .jpg image files, and not gifs or other format. Since processing all the labels into a nice usable form is in itself a reasonable amount of compute for the full dataset, I make the tag labels that I used during training for the top-6000 tag resnet50 available as a zipped csv at the link below:

class_labels_6000.zip

With this together with the numpy seed used for the training and validation splits, is should be possible to get the exact same train/validation split. If you have any trouble getting it to work, please

  1. Check the notebooks in the GitHub repo – they are the ones I used to base my training/data preparation on.
  2. Contact me in the ‘About’ page.

The GitHub repo contains the class names for the various #tags. For example, the third output activation of the 6000-tag resnet 50 model corresponds to the score for the third tag in the class_names_6000.json file in the repo. The repo also has the source notebooks I used to train the networks and the full precision mean and standard deviation constants needed to normalize an input image.

Acknowledgements

  1. Thanks a ton for the organizers of the Danbooru2018 dataset! Their citation is:

    Anonymous, The Danbooru Community, Gwern Branwen, & Aaron Gokaslan; “Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, 3 January 2019. Web. Accessed 2019-06-24.

  2. For more info on the ResNet architecture, see Deep Residual Learning for Image Recognition
  3. Fastai for their super useful library.

Citing

If you use the pretrained models and feel it was of value, please do give some form of shout-out, or if you prefer you may use the bibtex entry:

@misc{danbooru2018resnet,
    author = {Matthew Baas},
    title = {Danbooru2018 pretrained resnet models for PyTorch},
    howpublished = {\url{https://rf5.github.io}},
    url = {https://rf5.github.io/2019/07/08/danbuuro-pretrained.html},
    type = {pretrained model},
    year = {2019},
    month = {July},
    timestamp = {2019-07-08},
    note = {Accessed: DATE}
}
tags: deep learning - pytorch - danbooru - anime