ILMConfiguration#

Framework for combining vision and language models

class configilm.ConfigILM.ConfigILM#

__init__(config)#

Creates a ConfigILM model according to the provided ILMConfiguration

Parameters:: config (ILMConfiguration) – Configuration file of the model. See ILMConfiguration for details
Returns:: self

forward(batch)#

Model forward function that decides which parts of the model to use based on the configuration after checking that the input works with this kind of network.

Parameters:: batch – Input batch of single modality of list of batches of multiple modalities

Note: The text input of the VQA model will be automatically masked based on the: padding tokens of the tokenizer.

Returns:: logits of the network

get_tokenizer()#

Getter to the tokenizer of the text model if applicable.

Returns:: Tokenizer to specified huggingface text model.
Raises:: AttributeError if no text model is used

to(*args, **kwargs)#

Moves the model parts as well as the fusion methods and activations to a device or casts to a different type.

Parameters:

args – device, dtype or other specified formats. See nn.module.to()
kwargs – device, dtype or other specified formats. See nn.module.to()

class configilm.ConfigILM.ILMConfiguration#

Configuration dataclass that defines all properties of ConfigILM models. The datatypes within the dataclass are selected to be compatible with json serialization and deserialization.

Parameters:

channels –
Number of input channels for the image model.

Default:

3
class_names –
Names of the classes in the classifier. Usable for class-specific performance. If none, classes will be enumerated with numbers.

Default:

None
classes –
Number of classes for the output of IMAGE_CLASSIFICATION classifier or VQA_CLASSIFICATION classifier.

Default:

10
custom_fusion_activation –
Activation function inside all classification head layers. Only used for initialization. After initialization, the fusion activation is accessible via the fusion_activation property. Default options from torch.nn or torch.nn.functional can be used as strings (e.g. “nn.Tanh()”) or as callables (e.g. nn.Tanh()). If a custom function is used, it has to be a callable with a single input (tensor) and a single output (tensor) where the input tensor is single dimension (plus batch dimension). The activation is then passed as a tuple (str, m) where str is the string representation of the function and m is the callable, which has to be a torch.nn.Module and initialized with the correct parameters, e.g. (“tanh”, nn.Tanh()).

Default:

nn.Tanh()
custom_fusion_method –
Fusion method to combine text and image features. Callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). First input is flatten output of image model with dimension fusion_in, second input is flatten output of the text model with dimension fusion_in. Output should have the dimension fusion_out. Only used for initialization. After initialization, the fusion method is accessible via the fusion_method property. If a custom function is used, it has to be a callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). The fusion method is then passed as a tuple (str, f) where str is the string representation of the function and f is the callable, e.g. (“mul”, torch.mul).

Default:

torch.mul
drop_rate –
Dropout rate for timm models.

Default:

0.2
drop_path_rate –
Drop path rate for timm models.

Default:

0.2
fusion_dropout_rate –
Drop rate inside all classification head layers.

Default:

0.25
fusion_hidden –
Number of neurons inside the hidden layer of the classification head.

Default:

256
fusion_in –
Input dimension to the fusion method.

Default:

512
fusion_out –
Output dimension of the fusion method. If None, output will be same as input (e.g. for point-wise operations).

Default:

None
hf_model_name –
Name of the text model from huggingface if applicable. The model has to be a model for text sequence classification.

Default:

None
image_size –
Size of input images for image models. Only applicable for some specific models.

Default:

120
load_pretrained_hf_if_available –
Load pretrained weights for huggingface model.

Default:

True
load_pretrained_timm_if_available –
Load pretrained weights for timm model.

Default:

False
max_sequence_length –
Maximum sequence length of huggingface models. Sequences that are shorter will be padded, longer ones are cropped to this maximum length.

Default:

32
network_type –
Type of ILM-network. Available types are listed in ILMType enum.

Default:

ILMType.IMAGE_CLASSIFICATION
t_dropout_rate –
Dropout rate of the mapping from the huggingface text model to the dimension of the fusion method.

Default:

0.25
timm_model_name – (required) Name of the image model as defined in timm.list_models()
use_pooler_output –
Use the pooler output of the huggingface model if applicable and available. Otherwise, last hidden features will be flattened and used instead.

Default:

True
v_dropout_rate –
Dropout rate of the mapping from the timm image model to the dimension of the fusion method.

Default:

0.25
visual_features_out –
Output dimension of the timm image model. Dimension will be linearly mapped to fusion_in dimension with activation and dropout as specified.

Default:

512

__init__(timm_model_name, hf_model_name=None, image_size=120, channels=3, classes=10, class_names=None, network_type=ILMType.IMAGE_CLASSIFICATION, visual_features_out=512, fusion_in=512, fusion_out=None, fusion_hidden=256, v_dropout_rate=0.25, t_dropout_rate=0.25, fusion_dropout_rate=0.25, _fusion_method='torch.mul', _fusion_activation='nn.Tanh()', drop_rate=0.2, drop_path_rate=None, use_pooler_output=True, max_sequence_length=32, load_pretrained_timm_if_available=False, load_pretrained_hf_if_available=True, custom_fusion_method=None, custom_fusion_activation=None)#

Parameters:

timm_model_name (str) –
hf_model_name (Optional[str]) –
image_size (int) –
channels (int) –
classes (int) –
class_names (Optional[Sequence[str]]) –
network_type (ILMType) –
visual_features_out (int) –
fusion_in (int) –
fusion_out (Optional[int]) –
fusion_hidden (int) –
v_dropout_rate (float) –
t_dropout_rate (float) –
fusion_dropout_rate (float) –
_fusion_method (str) –
_fusion_activation (str) –
drop_rate (Optional[float]) –
drop_path_rate (Optional[float]) –
use_pooler_output (bool) –
max_sequence_length (int) –
load_pretrained_timm_if_available (bool) –
load_pretrained_hf_if_available (bool) –
custom_fusion_method (Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]]) –
custom_fusion_activation (Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]]) –

Return type:

None

as_dict()#: Returns the configuration as a dictionary.

channels: int = 3#

class_names: Optional[Sequence[str]] = None#

classes: int = 10#

custom_fusion_activation: Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]] = None#

custom_fusion_method: Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]] = None#

dif(other)#: Compares two ILMConfigurations and returns a dictionary with the differences.

drop_path_rate: Optional[float] = None#

drop_rate: Optional[float] = 0.2#

classmethod from_json(json_string)#: Loads a configuration from a json string.

property fusion_activation: Callable[[Tensor], Tensor]#

fusion_dropout_rate: float = 0.25#

fusion_hidden: int = 256#

fusion_in: int = 512#

property fusion_method: Callable[[Tensor, Tensor], Tensor]#

fusion_out: Optional[int] = None#

hf_model_name: Optional[str] = None#

image_size: int = 120#

load_pretrained_hf_if_available: bool = True#

load_pretrained_timm_if_available: bool = False#

max_sequence_length: int = 32#

network_type: ILMType = 0#

t_dropout_rate: float = 0.25#

timm_model_name: str#

to_json()#: Returns the configuration as a json string.

use_pooler_output: bool = True#

v_dropout_rate: float = 0.25#

visual_features_out: int = 512#

class configilm.ConfigILM.ILMType#

Class for different types of architectures supported by ILMConfigurations

IMAGE_CLASSIFICATION = 0#

VQA_CLASSIFICATION = 1#