HRVQA#
This page describes the usage of Dataloader and Datamodule for HRVQA, a high-resolution aerial VQA dataset based on the Open data provided by the Dutch governmental organizations. It was first published by Li et al. [2]. The dataset as well as more details can be found on their website. A small example of the data used is distributed with this package.
This module contains two classes
, a standard torch.util.data.Dataset
and a pytorch_lightning.LightningDataModule
that encapsulates the Dataset
for easy use in pytorch_lightning
applications. The Dataset
uses a BENLMDBReader
to read images a LMDB file. Questions and Answers are read using JSON files.
HRVQADataSet#
In its most basic form, the Dataset
only needs the base path of the image and json files. The path should follow the same structure as it is when downloaded and extracted using the official website with questions and answer next to each other. The official naming for files is expected.
The full data path structure expected is
datapath = {
"images": "/path/to/images",
"train_data": "/path/to/jsons",
"val_data": "/path/to/jsons",
"test_data": "/path/to/jsons"
}
Note, that the keys have to match exactly while the paths can be selected freely.
from configilm import util
util.MESSAGE_LEVEL = util.MessageLevel.INFO # use INFO to see all messages
from configilm.extra.DataSets import HRVQA_DataSet
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path # path to dataset
)
img, question, answer = ds[4]
Size: torch.Size([3, 1024, 1024])
Question: what is the main transportation of this image?
Question (start): [101, 2054, 2003, 1996, 2364, 5193, 1997, 2023, 3746, 1029, 102, 0, 0, 0, 0]
Answer: car
Answer (start): tensor([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])
Tokenizer and Tokenization#
As we can see, this Dataset uses a tokenizer to generate the Question out of a natural language text. If no tokenizer is provided, a default one will be used, however this may lead to bad performance if not accounted for. The tokenizer can be configured as input parameter.
from configilm.ConfigILM import _get_hf_model
tokenizer, _ = _get_hf_model("prajjwal1/bert-tiny")
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
tokenizer=tokenizer
)
img, question, answer = ds[0]
Tip
Usually this tokenizer is provided by the model itself as shown in the VQA example during dataset creation.
During tokenization a sequence of tokens (integers) of specific length is generated. The length of this sequence can be set with the parameter seq_length
. If the generated tokens are shorter than the sequence length, the sequence will be padded with zeros. If it is longer, the sequence is truncated.
Note
Most tokenizer use an ‘End of Sequence’ token that will always be the last one in the non-padded sequence.
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
tokenizer=tokenizer,
seq_length=16
)
_, question1, _ = ds[0]
print(question1)
[101, 2515, 2311, 4839, 1999, 2023, 3746, 1029, 102, 0, 0, 0, 0, 0, 0, 0]
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
tokenizer=tokenizer,
seq_length=8
)
_, question2, _ = ds[0]
print(question2)
[101, 2515, 2311, 4839, 1999, 2023, 3746, 102]
The tokenizer can also be used to reconstruct the input/question from the IDs including the special tokens:
print(f"Question 1: '{tokenizer.decode(question1)}'")
print(f"Question 2: '{tokenizer.decode(question2)}'")
Question 1: '[CLS] does building exist in this image? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'
Question 2: '[CLS] does building exist in this image [SEP]'
or without:
print(f"Question 1: '{tokenizer.decode(question1, skip_special_tokens=True)}'")
print(f"Question 2: '{tokenizer.decode(question2, skip_special_tokens=True)}'")
Question 1: 'does building exist in this image?'
Question 2: 'does building exist in this image'
Splits#
It is possible to load only a specific split ('train'
, 'val'
) in the dataset. The images are loaded on demand based on the requested question. By default (None
), both are loaded into the same Dataset
.
_ = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="val",
tokenizer=tokenizer
)
Because the Dataset does not define a 'test'
split, an option to subdivide the 'val'
split was added. This option divides the split at random based on a seed and a ratio provided.
Using 'test'
means, that the same data as 'val'
is used.
To split the dataset, use 'test-div'
and 'val-div'
. Provided the same div_seed
and split_size
(given as rounded-down proportion of the set that is in the 'val'
split) this option will fully divide the samples without information leak.
ds_v = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="val",
tokenizer=tokenizer,
div_seed=42,
split_size=0.66
)
print(f">> val split length {len(ds_v)}")
ds_vd = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="val-div",
tokenizer=tokenizer,
div_seed=42,
split_size=0.66
)
qv = set(ds_vd.qa_data)
print(f">> val-div split length {len(ds_vd)}")
ds_td = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="test-div",
tokenizer=tokenizer,
div_seed=42,
split_size=0.66
)
qt = set(ds_td.qa_data)
print(f">> test-div split length {len(ds_td)}")
print(f"\n>> intersection length {len(qv.intersection(qt))}")
print(f"\n>> union length {len(qv.union(qt))}")
5 QA-pairs indexed
5 QA-pairs used
>> val split length 5
3 QA-pairs indexed
3 QA-pairs used
>> val-div split length 3
2 QA-pairs indexed
2 QA-pairs used
>> test-div split length 2
>> intersection length 0
>> union length 5
Restricting the number of loaded images#
It is also possible to restrict the number of questions indexed. By setting max_img_idx = n
only the first n
images (in alphabetical order based on their S2-name) will be loaded. A max_img_idx
of None
, -1
or larger than the number of images in the csv file(s) (in this case 25) equals to load-all-images behaviour.
_ = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
max_len=5,
tokenizer=tokenizer
)
Show code cell output
10 QA-pairs indexed
5 QA-pairs used
Select Number of Classes or specific Answers#
For some applications, it is relevant to have only a certain number of classes as valid output. To prevent a dimension explosion, the number of classes can be limited. For the ‘train’ split, it is then automatically determined which combination of classes results in the highest reduction of the dataset.
train_ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="train",
tokenizer=tokenizer,
num_classes=3
)
Show code cell output
5 QA-pairs indexed
5 QA-pairs used
These selected answers can be re-used in other splits or limited if only a subset is required.
Note
The number of classes does not necessarily match the number of answers. If there are fewer answers then classes, the last classes will never be encoded in the one-hot encoded answer vector. If there are more, an IndexError
will happen during accessing a non encode-able element.
print(f"Train DS: {train_ds.answers}")
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="val",
tokenizer=tokenizer,
selected_answers=train_ds.answers
)
print(f"Val DS 1: {ds.answers}")
ds = HRVQA_DataSet.HRVQADataSet(
data_dirs=my_data_path, # path to dataset
split="val",
tokenizer=tokenizer,
selected_answers=train_ds.answers[:2],
)
print(f"Val DS 2: {ds.answers}")
Train DS: ['0.0m2', 'highdensity', 'topleft']
Val DS 1: ['0.0m2', 'highdensity']
Val DS 2: ['0.0m2', 'highdensity']
HRVQADataModule#
This class is a Lightning Data Module, that wraps the HRVQADataSet. It automatically generates DataLoader per split with augmentations, shuffling, etc., depending on the split. All images are resized and normalized and images in the train set additionally basic-augmented via noise and flipping/rotation. The train split is also shuffled, however this can be overwritten (see below).
To use a DataModule, the setup() function has to be called. This populates the Dataset splits inside the DataModule. Note, that there is no Test
split with answers provided in this dataset and therefore cannot be used here.
from configilm.extra.DataModules import HRVQA_DataModule
dm = HRVQA_DataModule.HRVQADataModule(
data_dirs=my_data_path # path to dataset
)
print("Before:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
Before:
None
None
None
dm.setup(stage="fit")
print("After:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
After:
<configilm.extra.DataSets.HRVQA_DataSet.HRVQADataSet object at 0x7f454aed7880>
<configilm.extra.DataSets.HRVQA_DataSet.HRVQADataSet object at 0x7f454aeafdf0>
None
Afterwards the pytorch DataLoader
can be easily accessed. Note, that \(len(DL) = \lceil \frac{len(DS)}{batch\_size} \rceil\), therefore here with the default batch_size
of 16: 25/16 -> 2.
train_loader = dm.train_dataloader()
print(len(train_loader))
1
The DataModule
has in addition to the DataLoader
settings a parameter each for data_dir
, image_size
and max_img_idx
which are passed through to the DataSet
.
DataLoader settings#
The DataLoader
have four settable parameters: batch_size
, num_workers_dataloader
, shuffle
and pin_memory
with 16, os.cpu_count()
/ 2, None
and None
as their default values.
A shuffle of None
means, that the train set is shuffled but validation and test are not.
Pinned Memory will be set if a CUDA
device is found, otherwise it will be of. However, this behaviour can be overwritten with pin_memory
. Changing some of these settings will be accompanied by a Message-Hint printed.
dm = HRVQA_DataModule.HRVQADataModule(
data_dirs=my_data_path, # path to dataset
batch_size=4,
tokenizer=tokenizer
)
dm.setup(stage="fit")
print(len(dm.train_dataloader()))
2
_ = HRVQA_DataModule.HRVQADataModule(
data_dirs=my_data_path, # path to dataset
shuffle=False
)
Show code cell output
/home/runner/work/ConfigILM/ConfigILM/configilm/extra/DataModules/ClassificationVQADataModule.py:109: UserWarning: Shuffle was set to False. This is not recommended for most configuration. Use shuffle=None (default) for recommended configuration.
warn(
_ = HRVQA_DataModule.HRVQADataModule(
data_dirs=my_data_path, # path to dataset
num_workers_dataloader=2
)
_ = HRVQA_DataModule.HRVQADataModule(
data_dirs=my_data_path, # path to dataset
pin_memory=False
)