This site should be considered in BETA status and is an active WIP!

The library ConfigILM is a state-of-the-art tool for Python developers seeking to rapidly and iteratively develop image and language models within the pytorch framework. This open-source library provides a convenient implementation for seamlessly combining models from two of the most popular pytorch libraries, the highly regarded timm and huggingface🤗. With an extensive collection of nearly 1000 image and over 100 language models, with an additional 120,000 community-uploaded models in the huggingface🤗 model collection, ConfigILM offers a diverse range of model combinations that require minimal implementation effort. Its vast array of models makes it an unparalleled resource for developers seeking to create innovative and sophisticated image-language models with ease.

Furthermore, ConfigILM boasts a user-friendly interface that streamlines the exchange of model components, thus providing endless possibilities for the creation of novel models. Additionally, the package offers pre-built and throughput-optimized pytorch dataloaders and lightning datamodules, which enable developers to seamlessly test their models in diverse application areas, such as Remote Sensing (RS). Moreover, the comprehensive documentation of ConfigILM includes installation instructions, tutorial examples, and a detailed overview of the framework’s interface, ensuring a smooth and hassle-free development experience.

This documentation outlines the easy installation process of the ConfigILM framework on the upcoming page. Subsequently, the individual components will be explored and exemplified in detail on the following pages. Should any issues arise, users are encouraged to visit the project’s dedicated GitHub page, where they can receive assistance from the community of users and developers.

Additionally, feature requests can be submitted via this platform.

If you use this work, please cite

  title={ConfigILM: A general purpose configurable library for combining image and language models for visual question answering},
  author={Hackel, Leonard and Clasen, Kai Norman and Demir, Beg{\"u}m},

and the used version of the software, e.g., the current version with

  author       = {lhackel-tub and
                  Kai Norman Clasen},
  title        = {lhackel-tub/ConfigILM: v0.6.4},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.6.4},
  doi          = {10.5281/zenodo.12095668},
  url          = {}

This work is supported by the European Research Council (ERC) through the ERC-2017-STG BigEarth Project under Grant 759764 and by the European Space Agency through the DA4DTE (Demonstrator precursor Digital Assistant interface for Digital Twin Earth) project and by the German Ministry for Economic Affairs and Climate Action through the AI-Cube Project under Grant 50EE2012B.