ILMConfiguration#

Framework for combining vision and language models

class configilm.ConfigILM.ConfigILM#
__init__(config)#

Creates a ConfigILM model according to the provided ILMConfiguration

Parameters:

config (ILMConfiguration) – Configuration file of the model. See ILMConfiguration for details

Returns:

self

forward(batch)#

Model forward function that decides which parts of the model to use based on the configuration after checking that the input works with this kind of network.

Parameters:

batch – Input batch of single modality of list of batches of multiple modalities

Note: The text input of the VQA model will be automatically masked based on the

padding tokens of the tokenizer.

Returns:

logits of the network

get_tokenizer()#

Getter to the tokenizer of the text model if applicable.

Returns:

Tokenizer to specified huggingface text model.

Raises:

AttributeError if no text model is used

to(*args, **kwargs)#

Moves the model parts as well as the fusion methods and activations to a device or casts to a different type.

Parameters:
  • args – device, dtype or other specified formats. See nn.module.to()

  • kwargs – device, dtype or other specified formats. See nn.module.to()

class configilm.ConfigILM.ILMConfiguration#

Configuration dataclass that defines all properties of ConfigILM models. The datatypes within the dataclass are selected to be compatible with json serialization and deserialization.

Parameters:
  • channels

    Number of input channels for the image model.

    Default:

    3

  • class_names

    Names of the classes in the classifier. Usable for class-specific performance. If none, classes will be enumerated with numbers.

    Default:

    None

  • classes

    Number of classes for the output of IMAGE_CLASSIFICATION classifier or VQA_CLASSIFICATION classifier.

    Default:

    10

  • custom_fusion_activation

    Activation function inside all classification head layers. Only used for initialization. After initialization, the fusion activation is accessible via the fusion_activation property. Default options from torch.nn or torch.nn.functional can be used as strings (e.g. “nn.Tanh()”) or as callables (e.g. nn.Tanh()). If a custom function is used, it has to be a callable with a single input (tensor) and a single output (tensor) where the input tensor is single dimension (plus batch dimension). The activation is then passed as a tuple (str, m) where str is the string representation of the function and m is the callable, which has to be a torch.nn.Module and initialized with the correct parameters, e.g. (“tanh”, nn.Tanh()).

    Default:

    nn.Tanh()

  • custom_fusion_method

    Fusion method to combine text and image features. Callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). First input is flatten output of image model with dimension fusion_in, second input is flatten output of the text model with dimension fusion_in. Output should have the dimension fusion_out. Only used for initialization. After initialization, the fusion method is accessible via the fusion_method property. If a custom function is used, it has to be a callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). The fusion method is then passed as a tuple (str, f) where str is the string representation of the function and f is the callable, e.g. (“mul”, torch.mul).

    Default:

    torch.mul

  • drop_rate

    Dropout rate for timm models.

    Default:

    0.2

  • drop_path_rate

    Drop path rate for timm models.

    Default:

    0.2

  • fusion_dropout_rate

    Drop rate inside all classification head layers.

    Default:

    0.25

  • fusion_hidden

    Number of neurons inside the hidden layer of the classification head.

    Default:

    256

  • fusion_in

    Input dimension to the fusion method.

    Default:

    512

  • fusion_out

    Output dimension of the fusion method. If None, output will be same as input (e.g. for point-wise operations).

    Default:

    None

  • hf_model_name

    Name of the text model from huggingface if applicable. The model has to be a model for text sequence classification.

    Default:

    None

  • image_size

    Size of input images for image models. Only applicable for some specific models.

    Default:

    120

  • load_pretrained_hf_if_available

    Load pretrained weights for huggingface model.

    Default:

    True

  • load_pretrained_timm_if_available

    Load pretrained weights for timm model.

    Default:

    False

  • max_sequence_length

    Maximum sequence length of huggingface models. Sequences that are shorter will be padded, longer ones are cropped to this maximum length.

    Default:

    32

  • network_type

    Type of ILM-network. Available types are listed in ILMType enum.

    Default:

    ILMType.IMAGE_CLASSIFICATION

  • t_dropout_rate

    Dropout rate of the mapping from the huggingface text model to the dimension of the fusion method.

    Default:

    0.25

  • timm_model_name – (required) Name of the image model as defined in timm.list_models()

  • use_pooler_output

    Use the pooler output of the huggingface model if applicable and available. Otherwise, last hidden features will be flattened and used instead.

    Default:

    True

  • v_dropout_rate

    Dropout rate of the mapping from the timm image model to the dimension of the fusion method.

    Default:

    0.25

  • visual_features_out

    Output dimension of the timm image model. Dimension will be linearly mapped to fusion_in dimension with activation and dropout as specified.

    Default:

    512

__init__(timm_model_name, hf_model_name=None, image_size=120, channels=3, classes=10, class_names=None, network_type=ILMType.IMAGE_CLASSIFICATION, visual_features_out=512, fusion_in=512, fusion_out=None, fusion_hidden=256, v_dropout_rate=0.25, t_dropout_rate=0.25, fusion_dropout_rate=0.25, _fusion_method='torch.mul', _fusion_activation='nn.Tanh()', drop_rate=0.2, drop_path_rate=None, use_pooler_output=True, max_sequence_length=32, load_pretrained_timm_if_available=False, load_pretrained_hf_if_available=True, custom_fusion_method=None, custom_fusion_activation=None)#
Parameters:
  • timm_model_name (str) –

  • hf_model_name (Optional[str]) –

  • image_size (int) –

  • channels (int) –

  • classes (int) –

  • class_names (Optional[Sequence[str]]) –

  • network_type (ILMType) –

  • visual_features_out (int) –

  • fusion_in (int) –

  • fusion_out (Optional[int]) –

  • fusion_hidden (int) –

  • v_dropout_rate (float) –

  • t_dropout_rate (float) –

  • fusion_dropout_rate (float) –

  • _fusion_method (str) –

  • _fusion_activation (str) –

  • drop_rate (Optional[float]) –

  • drop_path_rate (Optional[float]) –

  • use_pooler_output (bool) –

  • max_sequence_length (int) –

  • load_pretrained_timm_if_available (bool) –

  • load_pretrained_hf_if_available (bool) –

  • custom_fusion_method (Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]]) –

  • custom_fusion_activation (Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]]) –

Return type:

None

as_dict()#

Returns the configuration as a dictionary.

channels: int = 3#
class_names: Optional[Sequence[str]] = None#
classes: int = 10#
custom_fusion_activation: Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]] = None#
custom_fusion_method: Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]] = None#
dif(other)#

Compares two ILMConfigurations and returns a dictionary with the differences.

drop_path_rate: Optional[float] = None#
drop_rate: Optional[float] = 0.2#
classmethod from_json(json_string)#

Loads a configuration from a json string.

property fusion_activation: Callable[[Tensor], Tensor]#
fusion_dropout_rate: float = 0.25#
fusion_hidden: int = 256#
fusion_in: int = 512#
property fusion_method: Callable[[Tensor, Tensor], Tensor]#
fusion_out: Optional[int] = None#
hf_model_name: Optional[str] = None#
image_size: int = 120#
load_pretrained_hf_if_available: bool = True#
load_pretrained_timm_if_available: bool = False#
max_sequence_length: int = 32#
network_type: ILMType = 0#
t_dropout_rate: float = 0.25#
timm_model_name: str#
to_json()#

Returns the configuration as a json string.

use_pooler_output: bool = True#
v_dropout_rate: float = 0.25#
visual_features_out: int = 512#
class configilm.ConfigILM.ILMType#

Class for different types of architectures supported by ILMConfigurations

IMAGE_CLASSIFICATION = 0#
VQA_CLASSIFICATION = 1#