Using internal tools for datasets

In this tutorial, we will show you how to use the internal tools for datasets in DetoxAI. Note, that you don’t actually have to ever use them if you are using the library as a regular user. The detoxai.debias interface does not require you to use internal tools for datasets. You can just pass your regular torch dataloader to the detoxai.debias function and it will work just fine, as long as it returns batches of (image, label, protected_attribute) tuples.

However, if you are a developer who wants to do experiments or some other crazy stuff within the library, you might want to use the internal tools for datasets. They come in handy for experiments and for downloading datasets on the target machines (e.g., they were incredible useful for us in our ClearML-based experiments and running experiments on multiple machines).

Download supported datasets

Run download.py. It will discover all directories in datasets/catalog, download them and properly structure them, given that they have a links.yaml file and a handler.py file.

At the moment, DetoxAI is shipped with the following datasets supported:

CelebA
FairFace
Cifar10
Cifar100
Caltech101

Add a new dataset

Create a folder under datasets/catalog/<dataset_name>

2. In this folder add links.yml. In there, you want to put all the download links that let you fetch the desired dataset. The general structure of this file is as follows:

link1:
   url: [https://drive.google.com/uc?id=](https://drive.google.com/uc?id=)<file_id>
   output: data.zip
   type: google_drive
link2:
   url: [https://drive.google.com/uc?id=](https://drive.google.com/uc?id=)<file_id>
   output: l_train.csv
   type: google_drive

An exception is made for datasets from torchvision where you just put torchvision in links.yaml as it has a pretty standard interface, so it is easy to handle it.

Here are a few examples.

For CelebA:

link1:
url: https://www.kaggle.com/api/v1/datasets/download/jessicali9530/celeba-dataset
output: kaggle.zip
type: curl

For Cifar10:

torchvision

For FairFace:

link1:
url: https://drive.google.com/uc?id=1Z1RqRo0_JiavaZw2yzZG6WETdZQ8qX86
output: data.zip
type: google_drive
link2:
url: https://drive.google.com/uc?id=1i1L3Yqwaio7YSOCj7ftgk8ZZchPG7dmH
output: l_train.csv
type: google_drive
link3:
url: https://drive.google.com/uc?id=1wOdja-ezstMEp81tX1a-EYkFebev4h7D
output: l_val.csv
type: google_drive

3. You also need to implement a handler.py script, which will handle all the downloaded raw files and transform them into the format used by our system. The target format is as follows:

<dataset_name>/
   -> label_names.yaml
   -> labels.csv
   -> data/
      -> 0.jpg
      -> 1.jpg
      -> ...

label_names.yaml

attribute1:
   0: label1
   1: label2
   2: label3
   <...>
attribute2:
   0: label1
   1: label2
   <...>

labels.csv

image_id, attribute1, attribute2, <...>
0.jpg, 0, 1, <...>
1.jpg, 1, 0, <...>
<...>

There might be various ways to implement the handler.py script, and it depends on the dataset you are using. We highly recommend to check out the handler.py scripts for the datasets we already support.

In case you don’t want to browse the repo for yourself, here is one of them for CelebA, we have the following implementation:

import os
import shutil
import zipfile
import pandas as pd
import yaml

home = os.environ.get("DETOXAI_DATASET_PATH", os.path.expanduser("~"))
directory = os.path.join(home, "celeba")

tmp_directory = os.path.join(directory, "tmp")
data_path = os.path.join(tmp_directory, "kaggle.zip")


# Extract data to directory
with zipfile.ZipFile(data_path, "r") as zip_ref:
    zip_ref.extractall(tmp_directory)
    print("Done")

# Read csvs
df = pd.read_csv(os.path.join(tmp_directory, "list_attr_celeba.csv"))

# Transform all -1 to 0
df = df.replace(-1, 0)

# Create mapping for each attribute 1 - present, 0 - not present
mapping = {column: {1: "present", 0: "not present"} for column in df.columns[1:]}


mapping_path = os.path.join(directory, "labels_mapping.yaml")
with open(mapping_path, "w") as f:
    yaml.dump(mapping, f)

# Save concatenated csv
df.to_csv(os.path.join(directory, "labels.csv"), index=False)


# Move data from tmp to directory under data/
# it has now valid and train subdirectories but we don't need them
# so we will move the files to the data directory

data_directory = os.path.join(directory, "data")
os.makedirs(data_directory, exist_ok=True)

# Move files
img_dir = os.path.join(tmp_directory, "img_align_celeba")
for subdir, dirs, files in os.walk(img_dir):
    for file in files:
        source, extension = os.path.splitext(file)
        target = os.path.join(data_directory, f"{source.zfill(6)}{extension}")
        shutil.move(os.path.join(subdir, file), os.path.join(data_directory, target))

# Remove tmp directory
shutil.rmtree(tmp_directory)