RSVQA high resolution#

This page describes the usage of Dataloader and Datamodule for the high resolution version of RSVQA, a VQA dataset based on Sentinel-2 images over the Netherlands. It was first published by Lobry et al. [4]. The dataset can be found on zenodo DOI. A small example of the data used is distributed with this package.

This module contains two classes, a standard torch.util.data.Dataset and a pytorch_lightning.LightningDataModule that encapsulates the Dataset for easy use in pytorch_lightning applications. Questions and Answers are read using JSON files.

RSVQAHRDataSet#

In its most basic form, the Dataset only needs the base path to the image and json files. The path should follow the same structure as it is when downloaded from the official zenodo page and extracted using the official website with images, questions and answer next to each other. The official naming for files is expected.

The full data path structure expected is

datapath = {
    "images": "/path/to/Images/Data",
    "train_data": "/path/to/jsons",
    "val_data": "/path/to/jsons",
    "test_data": "/path/to/jsons"
    "test_phili_data": "/path/to/jsons"
}

Note, that the keys have to match exactly while the paths can be selected freely.

from configilm import util
util.MESSAGE_LEVEL = util.MessageLevel.INFO  # use INFO to see all messages

from configilm.extra.DataSets import RSVQAHR_DataSet
 
ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path  # path to dataset
)

img, question, answer = ds[4]
img = img[:3] # only choose RGB channels
Size: torch.Size([3, 256, 256])
Question: are there less buildings than large parkings?
Question (start): [101, 2024, 2045, 2625, 3121, 2084, 2312, 5581, 2015, 1029, 102, 0, 0, 0, 0]
Answer: no
Answer (start): tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
../../_images/01401629235bce842160d6d9e6eee165f0e94af91f382d1b46976d05ef31c0eb.png

Tokenizer and Tokenization#

As we can see, this Dataset uses a tokenizer to generate the Question out of a natural language text. If no tokenizer is provided, a default one will be used, however this may lead to bad performance if not accounted for. The tokenizer can be configured as input parameter.

from configilm.ConfigILM import _get_hf_model

tokenizer, _ = _get_hf_model("prajjwal1/bert-tiny")

ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    tokenizer=tokenizer
)
img, question, answer = ds[0]

Tip

Usually this tokenizer is provided by the model itself as shown in the VQA example during dataset creation.

During tokenization a sequence of tokens (integers) of specific length is generated. The length of this sequence can be set with the parameter seq_length. If the generated tokens are shorter than the sequence length, the sequence will be padded with zeros. If it is longer, the sequence is truncated.

Note

Most tokenizer use an ‘End of Sequence’ token that will always be the last one in the non-padded sequence.

ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    tokenizer=tokenizer,
    seq_length=16
)
_, question1, _ = ds[0]
print(question1)
[101, 2024, 2045, 2625, 3121, 2012, 1996, 2327, 1997, 1996, 7027, 2181, 2084, 3121, 1999, 102]
ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    tokenizer=tokenizer,
    seq_length=8
)
_, question2, _ = ds[0]
print(question2)
[101, 2024, 2045, 2625, 3121, 2012, 1996, 102]

The tokenizer can also be used to reconstruct the input/question from the IDs including the special tokens:

print(f"Question 1: '{tokenizer.decode(question1)}'")
print(f"Question 2: '{tokenizer.decode(question2)}'")
Question 1: '[CLS] are there less buildings at the top of the retail area than buildings in [SEP]'
Question 2: '[CLS] are there less buildings at the [SEP]'

or without:

print(f"Question 1: '{tokenizer.decode(question1, skip_special_tokens=True)}'")
print(f"Question 2: '{tokenizer.decode(question2, skip_special_tokens=True)}'")
Question 1: 'are there less buildings at the top of the retail area than buildings in'
Question 2: 'are there less buildings at the'

Selecting Bands#

Like for the BigEarthNet v1.0 DataSet, this DataSet supports different Band combinations. Currently, the selection is limited to some preconfigured combinations. Which bands are used is defined by the number of channels set in the Dataset. The selection can be seen when we use a faulty configuration.

try:
    ds = RSVQAHR_DataSet.RSVQAHRDataSet(
        data_dirs=my_data_path,  # path to dataset
        img_size=(-1, 120, 120)
    )
except AssertionError as a:
    print(a)
Hide code cell output
RSVQA-HR only supports RGB images.

Splits#

It is possible to load only a specific split ('train', 'val', 'test' or 'test_phili') in the dataset. . The images loaded are specified using the json files in the specified path. By default (None), all four are loaded into the same Dataset.

_ = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    split="test",
    tokenizer=tokenizer
)

Restricting the number of loaded images#

It is also possible to restrict the number of images indexed. By setting max_img_idx = n only the first n images (in alphabetical order based on their S2-name) will be loaded. A max_img_idx of None, -1 or larger than the number of images in the csv file(s) (in this case 25) equals to load-all-images behaviour.

_ = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    max_len=10,
    tokenizer=tokenizer
)
Hide code cell output
       1,201 QA-pairs indexed
          10 QA-pairs used
_ = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    max_len=100,
    tokenizer=tokenizer
)
Hide code cell output
       1,201 QA-pairs indexed
         100 QA-pairs used

Select Number of Classes or specific Answers#

For some applications, it is relevant to have only a certain number of classes as valid output. To prevent a dimension explosion if there are too many possible classes, the number of classes can be limited. For the ‘train’ split, it is then automatically determined which combination of classes results in the highest reduction of the dataset.

train_ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    split="train",
    tokenizer=tokenizer,
    num_classes=3
)
Hide code cell output
         400 QA-pairs indexed
         400 QA-pairs used

These selected answers can be re-used in other splits or limited if only a subset is required.

Note

The number of classes does not necessarily match the number of answers. If there are fewer answers then classes, the last classes will never be encoded in the one-hot encoded answer vector. If there are more, an IndexError will happen during accessing a non encode-able element.

print(f"Train DS: {train_ds.answers}")

ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    split="val",
    tokenizer=tokenizer,
    selected_answers=train_ds.answers
)
print(f"Val DS 1: {ds.answers}")

ds = RSVQAHR_DataSet.RSVQAHRDataSet(
    data_dirs=my_data_path,  # path to dataset
    split="val",
    tokenizer=tokenizer,
    selected_answers=train_ds.answers[:2],
)
print(f"Val DS 2: {ds.answers}")
Train DS: ['6', '8', 'between 11m2 and 100m2']
Val DS 1: ['6', '8']
Val DS 2: ['6', '8']

RSVQAHRDataModule#

This class is a Lightning Data Module, that wraps the RSVQAHRDataSet. It automatically generates DataLoader per split with augmentations, shuffling, etc., depending on the split. All images are resized and normalized and images in the train set additionally basic-augmented via noise and flipping/rotation. The train split is also shuffled, however this can be overwritten (see below). To use a DataModule, the setup() function has to be called. This populates the Dataset splits inside the DataModule. Depending on the stage (‘fit’, ‘test’ or None), the setup will prepare only train & validation Dataset, only test Dataset or all three.

from configilm.extra.DataModules import RSVQAHR_DataModule

dm = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path  # path to dataset
)
print("Before:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
Before:
None
None
None
dm.setup(stage="fit")
print("After:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
After:
<configilm.extra.DataSets.RSVQAHR_DataSet.RSVQAHRDataSet object at 0x7f56a1e25a50>
<configilm.extra.DataSets.RSVQAHR_DataSet.RSVQAHRDataSet object at 0x7f56a335c430>
None

Afterwards the pytorch DataLoader can be easily accessed. Note, that \(len(DL) = \lceil \frac{len(DS)}{batch\_size} \rceil\), therefore here with the default batch_size of 16: 25/16 -> 2.

train_loader = dm.train_dataloader()
print(len(train_loader))
25

The DataModule has in addition to the DataLoader settings a parameter each for data_dir, image_size and max_img_idx which are passed through to the DataSet.

DataLoader settings#

The DataLoader have four settable parameters: batch_size, num_workers_dataloader, shuffle and pin_memory with 16, os.cpu_count() / 2, None and None as their default values.

A shuffle of None means, that the train set is shuffled but validation and test are not. Pinned Memory will be set if a CUDA device is found, otherwise it will be of. However, this behaviour can be overwritten with pin_memory. Changing some of these settings will be accompanied by a Message-Hint printed.

dm = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    batch_size=4,
    tokenizer=tokenizer
)
dm.setup(stage="fit")
print(len(dm.train_dataloader()))
100
_ = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    shuffle=False
)
Hide code cell output
/home/runner/work/ConfigILM/ConfigILM/configilm/extra/DataModules/ClassificationVQADataModule.py:109: UserWarning: Shuffle was set to False. This is not recommended for most configuration. Use shuffle=None (default) for recommended configuration.
  warn(
_ = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    num_workers_dataloader=2
)
_ = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    pin_memory=False
)

Different Test split#

As the original DataSet contains different test splits (Test Set 1 is from the same data source as training and validation, Test Set 2 is from Philadelphia and a different distribution), a switch was added to the DataModule to decide which one to use at test time. By default, Test Set 1 ("test" in the DataSet) is used.

dm = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    batch_size=4,
    tokenizer=tokenizer,
    use_phili_test=False
)
dm.setup(stage="test")
Hide code cell output
         300 QA-pairs indexed
         300 QA-pairs used
  Total test samples:      300
dm = RSVQAHR_DataModule.RSVQAHRDataModule(
    data_dirs=my_data_path,  # path to dataset
    batch_size=4,
    tokenizer=tokenizer,
    use_phili_test=True
)
dm.setup(stage="test")
Hide code cell output
         201 QA-pairs indexed
         201 QA-pairs used
  Total test samples:      201