ILMConfiguration#
Framework for combining vision and language models
- class configilm.ConfigILM.ConfigILM#
- __init__(config)#
Creates a ConfigILM model according to the provided ILMConfiguration
- Parameters:
config (ILMConfiguration) – Configuration file of the model. See ILMConfiguration for details
- Returns:
self
- forward(batch)#
Model forward function that decides which parts of the model to use based on the configuration after checking that the input works with this kind of network.
- Parameters:
batch – Input batch of single modality of list of batches of multiple modalities
- Note: The text input of the VQA model will be automatically masked based on the
padding tokens of the tokenizer.
- Returns:
logits of the network
- get_tokenizer()#
Getter to the tokenizer of the text model if applicable.
- Returns:
Tokenizer to specified huggingface text model.
- Raises:
AttributeError if no text model is used
- to(*args, **kwargs)#
Moves the model parts as well as the fusion methods and activations to a device or casts to a different type.
- Parameters:
args – device, dtype or other specified formats. See nn.module.to()
kwargs – device, dtype or other specified formats. See nn.module.to()
- class configilm.ConfigILM.ILMConfiguration#
Configuration dataclass that defines all properties of ConfigILM models. The datatypes within the dataclass are selected to be compatible with json serialization and deserialization.
- Parameters:
channels –
Number of input channels for the image model.
- Default:
3
class_names –
Names of the classes in the classifier. Usable for class-specific performance. If none, classes will be enumerated with numbers.
- Default:
None
classes –
Number of classes for the output of IMAGE_CLASSIFICATION classifier or VQA_CLASSIFICATION classifier.
- Default:
10
custom_fusion_activation –
Activation function inside all classification head layers. Only used for initialization. After initialization, the fusion activation is accessible via the fusion_activation property. Default options from torch.nn or torch.nn.functional can be used as strings (e.g. “nn.Tanh()”) or as callables (e.g. nn.Tanh()). If a custom function is used, it has to be a callable with a single input (tensor) and a single output (tensor) where the input tensor is single dimension (plus batch dimension). The activation is then passed as a tuple (str, m) where str is the string representation of the function and m is the callable, which has to be a torch.nn.Module and initialized with the correct parameters, e.g. (“tanh”, nn.Tanh()).
- Default:
nn.Tanh()
custom_fusion_method –
Fusion method to combine text and image features. Callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). First input is flatten output of image model with dimension fusion_in, second input is flatten output of the text model with dimension fusion_in. Output should have the dimension fusion_out. Only used for initialization. After initialization, the fusion method is accessible via the fusion_method property. If a custom function is used, it has to be a callable with two inputs (tensor, tensor) and a single output (tensor) where each tensor is single dimension (plus batch dimension). The fusion method is then passed as a tuple (str, f) where str is the string representation of the function and f is the callable, e.g. (“mul”, torch.mul).
- Default:
torch.mul
drop_rate –
Dropout rate for timm models.
- Default:
0.2
drop_path_rate –
Drop path rate for timm models.
- Default:
0.2
fusion_dropout_rate –
Drop rate inside all classification head layers.
- Default:
0.25
fusion_hidden –
Number of neurons inside the hidden layer of the classification head.
- Default:
256
fusion_in –
Input dimension to the fusion method.
- Default:
512
fusion_out –
Output dimension of the fusion method. If None, output will be same as input (e.g. for point-wise operations).
- Default:
None
hf_model_name –
Name of the text model from huggingface if applicable. The model has to be a model for text sequence classification.
- Default:
None
image_size –
Size of input images for image models. Only applicable for some specific models.
- Default:
120
load_pretrained_hf_if_available –
Load pretrained weights for huggingface model.
- Default:
True
load_pretrained_timm_if_available –
Load pretrained weights for timm model.
- Default:
False
max_sequence_length –
Maximum sequence length of huggingface models. Sequences that are shorter will be padded, longer ones are cropped to this maximum length.
- Default:
32
network_type –
Type of ILM-network. Available types are listed in ILMType enum.
- Default:
ILMType.IMAGE_CLASSIFICATION
t_dropout_rate –
Dropout rate of the mapping from the huggingface text model to the dimension of the fusion method.
- Default:
0.25
timm_model_name – (required) Name of the image model as defined in timm.list_models()
use_pooler_output –
Use the pooler output of the huggingface model if applicable and available. Otherwise, last hidden features will be flattened and used instead.
- Default:
True
v_dropout_rate –
Dropout rate of the mapping from the timm image model to the dimension of the fusion method.
- Default:
0.25
visual_features_out –
Output dimension of the timm image model. Dimension will be linearly mapped to fusion_in dimension with activation and dropout as specified.
- Default:
512
- __init__(timm_model_name, hf_model_name=None, image_size=120, channels=3, classes=10, class_names=None, network_type=ILMType.IMAGE_CLASSIFICATION, visual_features_out=512, fusion_in=512, fusion_out=None, fusion_hidden=256, v_dropout_rate=0.25, t_dropout_rate=0.25, fusion_dropout_rate=0.25, _fusion_method='torch.mul', _fusion_activation='nn.Tanh()', drop_rate=0.2, drop_path_rate=None, use_pooler_output=True, max_sequence_length=32, load_pretrained_timm_if_available=False, load_pretrained_hf_if_available=True, custom_fusion_method=None, custom_fusion_activation=None)#
- Parameters:
timm_model_name (str) –
hf_model_name (Optional[str]) –
image_size (int) –
channels (int) –
classes (int) –
class_names (Optional[Sequence[str]]) –
network_type (ILMType) –
visual_features_out (int) –
fusion_in (int) –
fusion_out (Optional[int]) –
fusion_hidden (int) –
v_dropout_rate (float) –
t_dropout_rate (float) –
fusion_dropout_rate (float) –
_fusion_method (str) –
_fusion_activation (str) –
drop_rate (Optional[float]) –
drop_path_rate (Optional[float]) –
use_pooler_output (bool) –
max_sequence_length (int) –
load_pretrained_timm_if_available (bool) –
load_pretrained_hf_if_available (bool) –
custom_fusion_method (Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]]) –
custom_fusion_activation (Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]]) –
- Return type:
None
- as_dict()#
Returns the configuration as a dictionary.
- channels: int = 3#
- class_names: Optional[Sequence[str]] = None#
- classes: int = 10#
- custom_fusion_activation: Optional[Union[Callable[[Tensor], Tensor], str, Tuple[str, Callable]]] = None#
- custom_fusion_method: Optional[Union[Callable[[Tensor, Tensor], Tensor], str, Tuple[str, Callable]]] = None#
- dif(other)#
Compares two ILMConfigurations and returns a dictionary with the differences.
- drop_path_rate: Optional[float] = None#
- drop_rate: Optional[float] = 0.2#
- classmethod from_json(json_string)#
Loads a configuration from a json string.
- property fusion_activation: Callable[[Tensor], Tensor]#
- fusion_dropout_rate: float = 0.25#
- fusion_in: int = 512#
- property fusion_method: Callable[[Tensor, Tensor], Tensor]#
- fusion_out: Optional[int] = None#
- hf_model_name: Optional[str] = None#
- image_size: int = 120#
- load_pretrained_hf_if_available: bool = True#
- load_pretrained_timm_if_available: bool = False#
- max_sequence_length: int = 32#
- t_dropout_rate: float = 0.25#
- timm_model_name: str#
- to_json()#
Returns the configuration as a json string.
- use_pooler_output: bool = True#
- v_dropout_rate: float = 0.25#
- visual_features_out: int = 512#