BigEarthNet v1.0#

This page describes the usage of Dataloader and Datamodule for BigEarthNet v1.0, a multi-spectral multilabel Remote Sensing Land-Use/Land-Cover classification dataset.

The official paper of the BigEarthNet v1.0 (BigEarthNet-S2) dataset was initially published in Sumbul et al. [5] and updated to multi-modal BigEarthNet v1.0 in Sumbul et al. [6].

For detailed information on the dataset itself please refer to the publications and the BigEarthNet Guide.

The dataset is divided into two modules which contains two classes, a standard torch.util.data.Dataset and a pytorch_lightning.LightningDataModule that encapsulates the Dataset for easy use in pytorch_lightning applications. The Dataset uses a BENLMDBReader to read images and labels from a LMDB file. Labels are returned in their 19-label version as one-hot vector.

BENDataSet#

In its most basic form, the Dataset only needs the base path of the LMDB file and csv files, if the path is not “./”. The LMDB file name is assumed to be BigEarthNetEncoded.lmdb (note, that from an os point of view, LMDB files are folders). This Dataset will load 12 channels (10m + 20m Sentinel-2 + 10m Sentinel-1).

The full folder structure expected is

.
├── BigEarthNetEncoded.lmdb
│   ├── data.mdb
│   └── lock.mdb
├── test.csv
├── train.csv
└── val.csv
from configilm.extra.DataSets import BENv1_DataSet
from configilm.extra.DataModules import BENv1_DataModule

ds = BENv1_DataSet.BENv1DataSet(
    data_dirs=my_data_path  # path set by to dataset
)

img, lbl = ds[26]
img = img[:3] # only choose RGB channels
Size: torch.Size([3, 120, 120])
Labels:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1.])
../../_images/316d9d51aa4f45f0ece1408ce3953241b8456c69ee110254b6092b3cf8f6ac9a.png

Selecting Bands#

The Dataset also supports different channel configurations, however, setting the selected channels is only supported via image size selection and only limited combinations are available. To see the available combinations call BEN_DataModule_LMDB_Encoder.BENDataSet.get_available_channel_configurations(). Alternatively, a faulty configuration will display the possibilities as well whilst raising an AssertionError.

The configurations are working like setting the respective number as bands parameter in the LMDBReader.

BENv1_DataSet.BENv1DataSet.get_available_channel_configurations()

Splits#

It is possible to load only a specific split ('train', 'val' or 'test') in the dataset. The images loaded are specified using the csv files in the same folder as the LMDB file. By default (None), all three are loaded into the same Dataset.

_ = BENv1_DataSet.BENv1DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train"
)

Restricting the number of loaded images#

It is also possible to restrict the number of images indexed. By setting max_img_idx = n only the first n images (in alphabetical order based on their S2-name) will be loaded. A max_img_idx of None, -1 or larger than the number of images in the csv file(s) (in this case 25) equals to load-all-images behaviour.

_ = BENv1_DataSet.BENv1DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train",
    max_len=10
)
_ = BENv1_DataSet.BENv1DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train",
    max_len=100
)

BENDataModule#

This class is a Lightning Data Module, that wraps the BENDataSet. It automatically generates DataLoader per split with augmentations, shuffling, etc., depending on the split. All images are resized and normalized and images in the train set additionally basic-augmented via noise and flipping/rotation. The train split is also shuffled, however this can be overwritten (see below).

To use a DataModule, the setup() function has to be called. This populates the Dataset splits inside the DataModule. Depending on the stage ('fit', 'test' or None), the setup will prepare only train & validation Dataset, only test Dataset or all three.

dm = BENv1_DataModule.BENv1DataModule(
    data_dirs=my_data_path  # path set by to dataset
)
print("Before:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)

print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print("After:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
Before:
None
None
None

=== SETUP ===
=== END SETUP ===

After:
<configilm.extra.DataSets.BENv1_DataSet.BENv1DataSet object at 0x7fe315538b80>
<configilm.extra.DataSets.BENv1_DataSet.BENv1DataSet object at 0x7fe317faa320>
None

Afterwards the pytorch DataLoader can be easily accessed. Note, that \(len(DL) = \lceil \frac{len(DS)}{batch\_size} \rceil\), therefore here with the default batch_size of 16: 25/16 -> 2.

train_loader = dm.train_dataloader()
print(len(train_loader))
1

The DataModule has in addition to the DataLoader settings a parameter each for data_dir, image_size and max_img_idx which are passed through to the DataSet.

DataLoader settings#

The DataLoader have three settable parameters: batch_size, num_workers_dataloader and shuffle with 16, os.cpu_count() / 2 and None as their default values. A shuffle of None means, that the train set is shuffled but validation and test are not. Changing this setting will be accompanied by a Message-Hint printed.

Not changeable is the usage of pinned memory, which is set to True if a cuda-enabled device is found and False otherwise.

dm = BENv1_DataModule.BENv1DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    batch_size=4
)
print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print(len(dm.train_dataloader()))
[WARNING] Using default train transform.
[WARNING] Using default eval transform.

=== SETUP ===
=== END SETUP ===

3
_ = BENv1_DataModule.BENv1DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    shuffle=False
)
[WARNING] Shuffle was set to False. This is not recommended for most configuration. Use shuffle=None (default) for recommended configuration.
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
_ = BENv1_DataModule.BENv1DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    num_workers_dataloader=2
)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.