BigEarthNet v1.0#
This page describes the usage of Dataloader and Datamodule for BigEarthNet v1.0, a multi-spectral multilabel Remote Sensing Land-Use/Land-Cover classification dataset.
The official paper of the BigEarthNet v1.0 (BigEarthNet-S2) dataset was initially published in Sumbul et al. [6] and updated to multi-modal BigEarthNet v1.0 in Sumbul et al. [7].
For detailed information on the dataset itself please refer to the publications and the BigEarthNet Guide.
The dataset is divided into two modules which contains two classes
, a standard torch.util.data.Dataset
and a pytorch_lightning.LightningDataModule
that encapsulates the Dataset
for easy use in pytorch_lightning
applications. The Dataset
uses a BENv1LMDBReader
to read images and labels from a LMDB file. Labels are returned in their 19-label version as one-hot vector.
BENDataSet#
In its most basic form, the Dataset
only needs the base path of the LMDB file and csv files. Note, that from an os point of view, LMDB files are folders. This Dataset
will load 12 channels (10m + 20m Sentinel-2 + 10m Sentinel-1).
The full data path structure expected is
datapath = {
"images_lmdb": "/path/to/BigEarthNetEncoded.lmdb",
"train_data": "/path/to/train.csv",
"val_data": "/path/to/val.csv",
"test_data": "/path/to/test.csv"
}
Note, that the keys have to match exactly while the paths can be selected freely.
from configilm import util
util.MESSAGE_LEVEL = util.MessageLevel.INFO # use INFO to see all messages
from configilm.extra.DataSets import BENv1_DataSet
from configilm.extra.DataModules import BENv1_DataModule
ds = BENv1_DataSet.BENv1DataSet(
data_dirs=my_data_path # path set by to dataset
)
img, lbl = ds[26]
img = img[:3] # only choose RGB channels
[INFO] Loading BEN data for None...
[INFO] 30 patches indexed
[INFO] 30 pre-filtered patches indexed
[INFO] 30 filtered patches indexed
Size: torch.Size([3, 120, 120])
Labels:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1.])
Selecting Bands#
The Dataset
also supports different channel configurations, however, setting the selected channels is only supported via image size selection and only limited combinations are available. To see the available combinations call BENv1_DataSet.BENv1DataSet.get_available_channel_configurations()
. Alternatively, a faulty configuration will display the possibilities as well whilst raising an AssertionError
.
The configurations are working like setting the respective number as bands
parameter in the LMDBReader
.
BENv1_DataSet.BENv1DataSet.get_available_channel_configurations()
[HINT] Available channel configurations are:
[HINT] 2 -> Sentinel-1
[HINT] 3 -> RGB
[HINT] 4 -> 10m Sentinel-2
[HINT] 10 -> 10m + 20m Sentinel-2
[HINT] 12 -> 10m + 20m Sentinel-2 + 10m Sentinel-1
Splits#
It is possible to load only a specific split ('train'
, 'val'
or 'test'
) in the dataset. The images loaded are specified using the csv files specified in the data_dirs
parameter. By default (None
), all three splits are loaded into the same Dataset
.
_ = BENv1_DataSet.BENv1DataSet(
data_dirs=my_data_path, # path set by to dataset
split="train"
)
[INFO] Loading BEN data for train...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
Restricting the number of loaded images#
It is also possible to restrict the number of images indexed. By setting max_img_idx = n
only the first n
images (in alphabetical order based on their S2-name) will be loaded. A max_img_idx
of None
, -1
or larger than the number of images in the csv file(s) (in this case 10) equals to load-all-images behaviour.
_ = BENv1_DataSet.BENv1DataSet(
data_dirs=my_data_path, # path set by to dataset
split="train",
max_len=5
)
[INFO] Loading BEN data for train...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
_ = BENv1_DataSet.BENv1DataSet(
data_dirs=my_data_path, # path set by to dataset
split="train",
max_len=100
)
[INFO] Loading BEN data for train...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
BENDataModule#
This class is a Lightning Data Module, that wraps the BENv1_DataSet
. It automatically generates DataLoader
per split with augmentations, shuffling, etc., depending on the split. All images are resized and normalized and images in the train set additionally basic-augmented via noise and flipping/rotation. The train split is also shuffled, however this can be overwritten (see below).
To use a DataModule
, the setup()
function has to be called. This populates the Dataset
splits inside the DataModule
. Depending on the stage ('fit'
, 'test'
or None
), the setup will prepare only train & validation Dataset
, only test Dataset
or all three.
dm = BENv1_DataModule.BENv1DataModule(
data_dirs=my_data_path # path set by to dataset
)
print("Before:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print("After:")
print(dm.train_ds)
print(dm.val_ds)
print(dm.test_ds)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
Before:
None
None
None
=== SETUP ===
[INFO] Loading BEN data for train...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
[INFO] Loading BEN data for val...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
[INFO] Total training samples: 10 Total validation samples: 10
=== END SETUP ===
After:
<configilm.extra.DataSets.BENv1_DataSet.BENv1DataSet object at 0x75c137134e50>
<configilm.extra.DataSets.BENv1_DataSet.BENv1DataSet object at 0x75c13192d3f0>
None
Afterwards the pytorch DataLoader
can be easily accessed. Note, that \(len(DL) = \lceil \frac{len(DS)}{batch\_size} \rceil\), therefore here with the default batch_size
of 16: 10/16 -> 1.
train_loader = dm.train_dataloader()
print(len(train_loader))
1
The DataModule
has in addition to the DataLoader
settings a parameter each for data_dir
, image_size
and max_img_idx
which are passed through to the DataSet
.
DataLoader settings#
The DataLoader
have three settable parameters: batch_size
, num_workers_dataloader
and shuffle
with 16, os.cpu_count()
/ 2 and None
as their default values. A shuffle of None
means, that the train set is shuffled but validation and test are not. Changing this setting will be accompanied by a Message-Hint printed.
Not changeable is the usage of pinned memory, which is set to True
if a cuda-enabled device is found and False
otherwise.
dm = BENv1_DataModule.BENv1DataModule(
data_dirs=my_data_path, # path set by to dataset
batch_size=4
)
print("\n=== SETUP ===")
dm.setup(stage="fit")
print("=== END SETUP ===\n")
print(len(dm.train_dataloader()))
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
=== SETUP ===
[INFO] Loading BEN data for train...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
[INFO] Loading BEN data for val...
[INFO] 10 patches indexed
[INFO] 10 pre-filtered patches indexed
[INFO] 10 filtered patches indexed
[INFO] Total training samples: 10 Total validation samples: 10
=== END SETUP ===
3
_ = BENv1_DataModule.BENv1DataModule(
data_dirs=my_data_path, # path set by to dataset
shuffle=False
)
[WARNING] Shuffle was set to False. This is not recommended for most configuration. Use shuffle=None (default) for recommended configuration.
[WARNING] Using default train transform.
[WARNING] Using default eval transform.
_ = BENv1_DataModule.BENv1DataModule(
data_dirs=my_data_path, # path set by to dataset
num_workers_dataloader=2
)
[WARNING] Using default train transform.
[WARNING] Using default eval transform.