Toronto COCO-QA DataSet and DataModules#

class configilm.extra.DataSets.COCOQA_DataSet.COCOQADataSet#
__init__(data_dirs, split=None, transform=None, max_len=None, img_size=(3, 120, 120), selected_answers=None, num_classes=430, tokenizer=None, seq_length=64, return_extras=False)#

This class implements the COCO-QA dataset. It is a subclass of ClassificationVQADataset and provides some dataset specific functionality.

Parameters:
  • data_dirs (Mapping[str, Path]) – A mapping from file key to file path. The file key is used to identify the function of the file. For example, the key “questions.txt” is used to identify the file that contains the questions. The file path can be either a string or a Path object. Required keys are “images”, “train_data” and “test_data”. The “_data” keys each point to a directory that contains the data files which are named “questions.txt”, “answers.txt”, “img_ids.txt” and “types.txt”.

  • split (Optional[str]) –

    The name of the split to use. Can be either “train” or “test”. If None is provided, all splits will be used.

    default:

    None

  • transform (Optional[Callable]) –

    A callable that is used to transform the images after loading them. If None is provided, no transformation is applied.

    default:

    None

  • max_len (Optional[int]) –

    The maximum number of qa-pairs to use. If None or -1 is provided, all qa-pairs are used.

    default:

    None

  • img_size (tuple) –

    The size of the images.

    default:

    (3, 120, 120)

  • selected_answers (Optional[list]) –

    A list of answers that should be used. If None is provided, the num_classes most common answers are used. If selected_answers is not None, num_classes is ignored.

    default:

    None

  • num_classes (Optional[int]) –

    The number of classes to use. Only used if selected_answers is None. If set to None, all answers are used.

    default:

    430

  • tokenizer (Optional[Callable]) –

    A callable that is used to tokenize the questions. If set to None, the default tokenizer (from configilm.util) is used.

    default:

    None

  • seq_length (int) –

    The maximum length of the tokenized questions.

    default:

    64

  • return_extras (bool) –

    If True, the dataset will return the type of the question in addition to the image, question and answer.

    default:

    False

load_image(key)#

This method should load the image with the given name and return it as a tensor.

Parameters:

key (str) – The name of the image to load

Returns:

The image as a tensor

Return type:

Tensor

prepare_split(split)#

This method should return a list of tuples, where each tuple contains the following elements:

  • The key of the image at index 0

  • The question at index 1

  • The answer at index 2

  • additional information at index 3 and higher

Parameters:

split (str) – The name of the split to prepare

Returns:

A list of tuples, each tuple containing the elements described

split_names()#

Returns the names of the splits that are available for this dataset. The default implementation returns {“train”, “val”, “test”}. If you want to use different names, you should override this method.

Returns:

A set of strings, each string being the name of a split

Return type:

set[str]

configilm.extra.DataSets.COCOQA_DataSet.resolve_data_dir(data_dir, allow_mock=False, force_mock=False)#

Helper function that tries to resolve the correct directory

Parameters:
  • data_dir (Optional[Mapping[str, Path]]) – current path that is suggested

  • allow_mock (bool) –

    if True, mock data will be used if no real data is found

    Default:

    False

  • force_mock (bool) –

    if True, only mock data will be used

    Default:

    False

Returns:

a dict with all paths to the data

Return type:

Mapping[str, Union[str, Path]]

class configilm.extra.DataModules.COCOQA_DataModule.COCOQADataModule#
__init__(data_dirs, batch_size=16, img_size=(3, 120, 120), num_workers_dataloader=4, shuffle=None, max_len=None, tokenizer=None, seq_length=64, pin_memory=None)#

This class implements the DataModule for the COCO-QA dataset.

Parameters:
  • data_dirs (Mapping[str, Path]) – A dictionary containing the paths to the data directories. Should contain the keys “images”, “train_data”, and “test_data”. The “images” directory should contain two subdirectories “train2014” and “val2014” with the images for the training and validation set, respectively. The “train_data” and “test_data” directories should contain for txt files “questions.txt”, “answers.txt”, “img_ids.txt”, and “types.txt” each, following the COCO-QA dataset format.

  • batch_size (int) –

    The batch size to use for the dataloaders.

    default:

    16

  • img_size (tuple) –

    The size of the images.

    default:

    (3, 120, 120)

  • num_workers_dataloader (int) –

    The number of workers to use for the dataloaders.

    default:

    4

  • shuffle (Optional[bool]) –

    Whether to shuffle the data in the dataloaders. If None is provided, the data is shuffled for training and not shuffled for validation and test.

    default:

    None

  • max_len (Optional[int]) –

    The maximum number of qa-pairs to use. If None or -1 is provided, all qa-pairs are used.

    default:

    None

  • tokenizer (Optional[Callable]) –

    A callable that is used to tokenize the questions. If set to None, the default tokenizer (from configilm.util) is used.

    default:

    None

  • seq_length (int) –

    The maximum length of the tokenized questions. If the tokenized question is longer than this, it will be truncated. If it is shorter, it will be padded.

    default:

    64

  • pin_memory (Optional[bool]) –

    Whether to use pinned memory for the dataloaders. If None is provided, it is set to True if a GPU is available and False otherwise.

    default:

    None

prepare_data()#

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()
setup(stage=None)#

Prepares the data sets for the specific stage.

  • “fit”: train and validation data set

  • “test”: test data set

  • None: all data sets

Parameters:

stage (Optional[str]) –

None, “fit”, or “test”

default:

None

test_dataloader()#

Returns the dataloader for the test data.

Raises:

AssertionError if the test dataset is not set up. This can happen if the setup() method is not called before this method or the dataset has no test data.

train_dataloader()#

Returns the dataloader for the training data.

Raises:

AssertionError if the training dataset is not set up. This can happen if the setup() method is not called before this method or the dataset has no training data.

val_dataloader()#

Returns the dataloader for the validation data.

Raises:

AssertionError if the validation dataset is not set up. This can happen if the setup() method is not called before this method or the dataset has no validation data.