BigEarthNet v2.0 / refined BigEarthNet#

This page describes the usage of Dataloader and Datamodule for refined BigEarthNet (also known as reBEN or BigEarthNet v2.0), a multi-spectral multilabel Remote Sensing Land-Use/Land-Cover classification dataset.

The official paper of the refined BigEarthNet dataset was initially published in Clasen et al. [1].

For detailed information on the dataset itself please refer to the publications.

The dataset is divided into two modules which contains two classes, a standard torch.util.data.Dataset and a pytorch_lightning.LightningDataModule that encapsulates the Dataset for easy use in pytorch_lightning applications. The Dataset uses a BENv2LMDBReader to read images and labels from a LMDB file. Labels are returned in their 19-label version as one-hot vector.

BENv2DataSet#

In its most basic form, the Dataset only needs the base path of the LMDB file and metadata parquet files. Note, that from an os point of view, LMDB files are folders. This Dataset will load 3 channels (10m RGB Sentinel-2).

The full data path structure expected is

datapath = {
    "images_lmdb": "/path/to/BigEarthNetEncoded.lmdb",
    "metadata_parquet": "/path/to/metadata.parquet",
    "metadata_snow_cloud_parquet": "/path/to/metadata_snow_cloud.parquet",
}

Note, that the keys have to match exactly while the paths can be selected freely.

from configilm import util
util.MESSAGE_LEVEL = util.MessageLevel.INFO  # use INFO to see all messages

from configilm.extra.DataSets import BENv2_DataSet
from configilm.extra.DataModules import BENv2_DataModule


ds = BENv2_DataSet.BENv2DataSet(
    data_dirs=my_data_path  # path set by to dataset
)

img, lbl = ds[0]
img = img # only choose RGB channels
[INFO]    Loading BEN data for None...
[INFO]        18 patches indexed
[INFO]        18 pre-filtered patches indexed
[INFO]        18 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
[INFO]    Opening LMDB environment ...
Size: torch.Size([3, 120, 120])
Labels:
tensor([0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
        0.], dtype=torch.float64)
../../_images/73049ca0309bb8f1adde0310befdbb6db975053144e6782425f0d730ad9682af.png

Selecting Bands#

The Dataset also supports different channel configurations, however, setting the selected channels is only supported via image size selection and only limited combinations are available. To see the available combinations call BENv2_DataSet.BENv2DataSet.get_available_channel_configurations(). Alternatively, a faulty configuration will display the possibilities as well whilst raising an AssertionError.

The configurations are working like setting the respective number as bands parameter in the LMDBReader.

BENv2_DataSet.BENv2DataSet.get_available_channel_configurations()
[HINT]    Available channel configurations are:
[HINT]          2 -> Sentinel-1
[HINT]          3 -> RGB
[HINT]          4 -> 10m Sentinel-2
[HINT]         10 -> 10m + 20m Sentinel-2 (in original order)
[HINT]         12 -> Sentinel-1 + 10m + 20m Sentinel-2 (in original order)
[HINT]         14 -> Sentinel-1 + 10m + 20m + 60m Sentinel-2 (in original order)

Splits#

It is possible to load only a specific split ('train', 'val' or 'test') in the dataset. The images loaded are specified using the parquet files in the data_dir parameter. By default (None), all three splits are loaded into the same Dataset.

_ = BENv2_DataSet.BENv2DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train"
)
[INFO]    Loading BEN data for train...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created

Restricting the number of loaded images#

It is also possible to restrict the number of images indexed. By setting max_img_idx = n only the first n images (in alphabetical order based on their S2-name) will be loaded. A max_img_idx of None, -1 or larger than the number of images in the parquet file(s) (in this case 6) equals to load-all-images behaviour.

_ = BENv2_DataSet.BENv2DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train",
    max_len=5
)
[INFO]    Loading BEN data for train...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        5 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
_ = BENv2_DataSet.BENv2DataSet(
    data_dirs=my_data_path,  # path set by to dataset
    split="train",
    max_len=100
)
[INFO]    Loading BEN data for train...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created

BENv2DataModule#

This class is a Lightning Data Module, that wraps the BENv2DataSet. It automatically generates DataLoader per split with augmentations, shuffling, etc., depending on the split. All images are resized and normalized and images in the train set additionally basic-augmented via noise and flipping/rotation. The train split is also shuffled, however this can be overwritten (see below).

To use a DataModule, the setup() function has to be called. This populates the Dataset splits inside the DataModule. Depending on the stage ('fit', 'test' or None), the setup will prepare only train & validation Dataset, only test Dataset or all three.

dm = BENv2_DataModule.BENv2DataModule(
    data_dirs=my_data_path  # path set by to dataset
)
print("Before:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)

print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print("After:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
Before:
None
None
None

=== SETUP ===
[INFO]    Loading BEN data for train...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
[INFO]    Loading BEN data for validation...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
[INFO]      Total training samples:        6  Total validation samples:        6
=== END SETUP ===

After:
<configilm.extra.DataSets.BENv2_DataSet.BENv2DataSet object at 0x7f13e4704340>
<configilm.extra.DataSets.BENv2_DataSet.BENv2DataSet object at 0x7f14c8fcd4b0>
None

Afterwards the pytorch DataLoader can be easily accessed. Note, that \(len(DL) = \lceil \frac{len(DS)}{batch\_size} \rceil\), therefore here with the default batch_size of 16: 6/16 -> 1.

train_loader = dm.train_dataloader()
print(len(train_loader))
1

The DataModule has in addition to the DataLoader settings a parameter each for data_dir, image_size and max_img_idx which are passed through to the DataSet.

DataLoader settings#

The DataLoader have three settable parameters: batch_size, num_workers_dataloader and shuffle with 16, os.cpu_count() / 2 and None as their default values. A shuffle of None means, that the train set is shuffled but validation and test are not. Changing this setting will be accompanied by a Message-Hint printed.

Not changeable is the usage of pinned memory, which is set to True if a cuda-enabled device is found and False otherwise.

dm = BENv2_DataModule.BENv2DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    batch_size=4
)
print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print(len(dm.train_dataloader()))
[WARNING] Using default train transform.
[WARNING] Using default eval transform.

=== SETUP ===
[INFO]    Loading BEN data for train...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
[INFO]    Loading BEN data for validation...
[INFO]        6 patches indexed
[INFO]        6 pre-filtered patches indexed
[INFO]        6 filtered patches indexed
[INFO]    Merged metadata with snow/cloud metadata
[INFO]    Loaded 24 labels
[INFO]    Loaded 24 keys
[INFO]    Loaded mapping created
[INFO]      Total training samples:        6  Total validation samples:        6
=== END SETUP ===

2
_ = BENv2_DataModule.BENv2DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    shuffle=False
)
[WARNING] Shuffle was set to False. This is not recommended for most configuration. Use shuffle=None (default) for recommended configuration.
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
_ = BENv2_DataModule.BENv2DataModule(
    data_dirs=my_data_path,  # path set by to dataset
    num_workers_dataloader=2
)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.