Accessibility & AI: Generating Alt Texts for Images with LLMs?
Gregor Sieber
May 27, 2024
·
6 MIN Reading time

Accessibility & AI: Generating Alt Texts for Images with LLMs?

A comparison of currently available tools and methods by Niklas Großmann.

Background

With the upcoming Accessibility Act 2025, which is based on Directive 2019/882 "European Accessibility Act" (EAA), the discussion around alternative texts for images is gaining importance. This law, which comes into effect on June 28, 2025, obliges website operators and companies in the media, education, and book sectors to make their digital content accessible to everyone.

What are Alternative Texts and Tags?

Alternative texts are short descriptions that explain the content of an image. They are used in HTML and other document formats and appear when the image cannot be loaded or when, for example, a screen reader reads the image for users with visual impairments. These texts should be brief, concise, yet informative to best support the users. Tags, on the other hand, are keywords or phrases assigned to an image to categorize it and make it searchable in search engines or media archives.

 

Where is the Challenge for Organizations and Companies?

Generating such image descriptions, especially for large amounts of existing data, represents a considerable effort, prompting many organizations to consider using AI to automate this process at least partially. However, there is a high demand for quality.

Analysis

Against this backdrop, we at EBCONT conducted a comprehensive comparison of various artificial intelligence solutions. We examined leading commercial large-language models (LLMs) like GPT-4 Vision by OpenAI and Claude-3 Sonnet by Anthropic. We also considered open-source models like LLama and out-of-the-box cloud services like Azure Computer Vision and AWS Rekognition. All these systems were tested with the same media assets to better understand the strengths and weaknesses of each model and identify the suitable AI solution for specific use cases.

 

The data used had different levels of complexity. We analyzed images with and without text and people in various qualities, sizes, and formats, including photographs, drawn images, and generated content. The evaluation was partly carried out on behalf of customers using copyrighted data from EBCONT customers. Below, we provide examples using freely available content.

Reschensee mit Kirchturm der alten Pfarrkirche St. Katharina

Example 1: Photo

This image was described with varying precision by different models. Alternative texts from Azure Computer Vision, such as "A tower in the water" and "Church tower in calm waters," are correct in content but should be more informative to describe the image sufficiently in detail.

Open-source large-language models like LLava generally describe images correctly but occasionally produce faulty descriptions. For example, in an image showing bright green trees, the incorrect description "yellow bushes" was generated: "The image shows a small white tower with a red tip standing in a quiet lake. Behind the tower are mountains with green and yellow bushes."

Alternative texts from commercial large-language models provided appropriate and concise results: "An old church tower with a pointed roof rises from a lake, surrounded by wooded hills and mountains under a partly cloudy sky."

The tags generated by the AI solutions, such as "bell tower," "lake," "nature," and "mountain," were all accurate.

 

This is a great day

Example 2: Photo with Text

The peculiarity of this image is that it contains text, which should be included in a good image description. Our analysis showed that out-of-the-box services and open-source models often struggle with Optical Character Recognition (OCR), leading to alternative texts like "A light box with text on it" or "A bright light panel with the text TAKE HIS AY, 0 E T!". Commercial LLMs like GPT-4 and Claude-3, on the other hand, showed significantly fewer problems in this area, providing more accurate descriptions: "A light panel bears the inscription 'MAKE THIS DAY GREAT!' in colorful letters. The background is blurred and in bright tones."

For this image, precise tags like "text," "electronics," "display panel," and "sign" were created by the various AI solutions. However, some models also returned incorrect tags like "number" or "cell phone," most of which were generated by open-source models. Commercial services like Azure Computer Vision and AWS Rekognition provided not only the most accurate but also the fewest erroneous tags.

 

Eine Karikatur eines Jungen und eines Mädchens

Example 3: Image with Text

Images like this one, which have increased complexity, led to less precise and sometimes incorrect results from the tested AI solutions. It was noted that out-of-the-box cloud services often provided inadequate descriptions like "A cartoon of a boy and a girl" through generalization. The best results for complex images were achieved by LLMs. Although open-source models delivered quite good results here, this study again showed that commercial LLMs provide the best results: "A cartoon of two children with speech bubbles showing various ways to factorize the number 990. One child seems to be explaining or teaching, while the speech bubbles show various factorization equations."

Despite the increased complexity of the images, the AI solutions still returned appropriate tags such as "cartoon" and "illustration."

 

Results

The analysis of this comprehensive comparison showed that artificial intelligence is a suitable solution for creating alternative texts and tags. Significant differences in the accuracy of various AI models emerged, especially in the alternative texts. Some systems were particularly good at handling photographic content, while others were more efficient at processing drawn or generated images. However, the key factors were the complexity and text content of the images.

Simple images were mostly correctly described by all tested AI solutions, albeit with varying levels of detail. For images with text or more complex content, the study showed that only generative language models (LLMs) could provide adequate alternative texts. The study concluded that commercial LLMs such as GPT-4 Vision and Claude-3 Sonnet achieved the best results.

It is important to mention that all generative LLMs occasionally produced inaccurate or faulty alternative texts. However, the study showed that this occurred significantly less frequently with commercial LLMs than with open-source models.

Another aspect of LLMs is that they sometimes fail to generate alternative texts. This happens when one of the content filters is activated, preventing the creation of an alternative text. For example, GPT-4 has filters for "hate," "self-harm," "sexuality," and "violence." This study concluded that these filters are sometimes configured too sensitively. For example, in an image of a woman sleeping in bed, the "sexuality" filter was mistakenly activated, and no alternative text was generated.

The generation of tags produced satisfactory results for most images through various AI models. For higher complexity, Azure Computer Vision proved to be the best solution for our media assets, as the error rate was the lowest.

Conclusion

The investigation results make it clear that AI solutions for alternative texts and tags are a highly sensible method. The right models can automatically generate correct and meaningful results. However, it must be emphasized that no model delivers error-free results, which is why manual control of the results remains necessary. Despite these limitations, machine learning models offer an efficient and useful solution for complying with the Accessibility Act.

This conclusion aligns with the view of the German Publishers and Booksellers Association, which is also convinced that AI is a suitable solution for complying with the Accessibility Act (Source: https://www.boersenverein.de/beratung-service/barrierefreiheit/faq/#accordion--42).

 

Architectures for Using AI for Alternative Texts and Tags in an Enterprise Context

Given the rapid development of multimodal generative AI models and, as seen in the test cases, the suitability of different models for various scenarios, it is advisable to make the invocation of a specific AI model interchangeable. This generally means setting up a processing infrastructure for your own data where different models can be controlled via configuration or modularity. Typically, a generic pre- and post-processing of the data is necessary (e.g., reading, format conversion, writing metadata back, etc.). Additionally, model-specific processing steps may be required. Due to the development of the models, it is advisable to build the infrastructure in such a way that tests can be conducted easily and a high degree of automation is present. It is important to ensure that manual control by humans can be carried out with minimal effort.