Background
With the upcoming Accessibility Act 2025, which is based on Directive 2019/882 "European Accessibility Act" (EAA), the discussion around alternative texts for images is gaining importance. This law, which comes into effect on June 28, 2025, obliges website operators and companies in the media, education, and book sectors to make their digital content accessible to everyone.
Where is the Challenge for Organizations and Companies?
Generating such image descriptions, especially for large amounts of existing data, represents a considerable effort, prompting many organizations to consider using AI to automate this process at least partially. However, there is a high demand for quality.
Analysis
Against this backdrop, we at EBCONT conducted a comprehensive comparison of various artificial intelligence solutions. We examined leading commercial large-language models (LLMs) like GPT-4 Vision by OpenAI and Claude-3 Sonnet by Anthropic. We also considered open-source models like LLama and out-of-the-box cloud services like Azure Computer Vision and AWS Rekognition. All these systems were tested with the same media assets to better understand the strengths and weaknesses of each model and identify the suitable AI solution for specific use cases.
The data used had different levels of complexity. We analyzed images with and without text and people in various qualities, sizes, and formats, including photographs, drawn images, and generated content. The evaluation was partly carried out on behalf of customers using copyrighted data from EBCONT customers. Below, we provide examples using freely available content.
Example 1: Photo
This image was described with varying precision by different models. Alternative texts from Azure Computer Vision, such as "A tower in the water" and "Church tower in calm waters," are correct in content but should be more informative to describe the image sufficiently in detail.
Open-source large-language models like LLava generally describe images correctly but occasionally produce faulty descriptions. For example, in an image showing bright green trees, the incorrect description "yellow bushes" was generated: "The image shows a small white tower with a red tip standing in a quiet lake. Behind the tower are mountains with green and yellow bushes."
Alternative texts from commercial large-language models provided appropriate and concise results: "An old church tower with a pointed roof rises from a lake, surrounded by wooded hills and mountains under a partly cloudy sky."
The tags generated by the AI solutions, such as "bell tower," "lake," "nature," and "mountain," were all accurate.
Example 2: Photo with Text
The peculiarity of this image is that it contains text, which should be included in a good image description. Our analysis showed that out-of-the-box services and open-source models often struggle with Optical Character Recognition (OCR), leading to alternative texts like "A light box with text on it" or "A bright light panel with the text TAKE HIS AY, 0 E T!". Commercial LLMs like GPT-4 and Claude-3, on the other hand, showed significantly fewer problems in this area, providing more accurate descriptions: "A light panel bears the inscription 'MAKE THIS DAY GREAT!' in colorful letters. The background is blurred and in bright tones."
For this image, precise tags like "text," "electronics," "display panel," and "sign" were created by the various AI solutions. However, some models also returned incorrect tags like "number" or "cell phone," most of which were generated by open-source models. Commercial services like Azure Computer Vision and AWS Rekognition provided not only the most accurate but also the fewest erroneous tags.
Example 3: Image with Text
Images like this one, which have increased complexity, led to less precise and sometimes incorrect results from the tested AI solutions. It was noted that out-of-the-box cloud services often provided inadequate descriptions like "A cartoon of a boy and a girl" through generalization. The best results for complex images were achieved by LLMs. Although open-source models delivered quite good results here, this study again showed that commercial LLMs provide the best results: "A cartoon of two children with speech bubbles showing various ways to factorize the number 990. One child seems to be explaining or teaching, while the speech bubbles show various factorization equations."
Despite the increased complexity of the images, the AI solutions still returned appropriate tags such as "cartoon" and "illustration."
Results
The analysis of this comprehensive comparison showed that artificial intelligence is a suitable solution for creating alternative texts and tags. Significant differences in the accuracy of various AI models emerged, especially in the alternative texts. Some systems were particularly good at handling photographic content, while others were more efficient at processing drawn or generated images. However, the key factors were the complexity and text content of the images.
Simple images were mostly correctly described by all tested AI solutions, albeit with varying levels of detail. For images with text or more complex content, the study showed that only generative language models (LLMs) could provide adequate alternative texts. The study concluded that commercial LLMs such as GPT-4 Vision and Claude-3 Sonnet achieved the best results.
It is important to mention that all generative LLMs occasionally produced inaccurate or faulty alternative texts. However, the study showed that this occurred significantly less frequently with commercial LLMs than with open-source models.
Another aspect of LLMs is that they sometimes fail to generate alternative texts. This happens when one of the content filters is activated, preventing the creation of an alternative text. For example, GPT-4 has filters for "hate," "self-harm," "sexuality," and "violence." This study concluded that these filters are sometimes configured too sensitively. For example, in an image of a woman sleeping in bed, the "sexuality" filter was mistakenly activated, and no alternative text was generated.
The generation of tags produced satisfactory results for most images through various AI models. For higher complexity, Azure Computer Vision proved to be the best solution for our media assets, as the error rate was the lowest.
Conclusion
The investigation results make it clear that AI solutions for alternative texts and tags are a highly sensible method. The right models can automatically generate correct and meaningful results. However, it must be emphasized that no model delivers error-free results, which is why manual control of the results remains necessary. Despite these limitations, machine learning models offer an efficient and useful solution for complying with the Accessibility Act.
This conclusion aligns with the view of the German Publishers and Booksellers Association, which is also convinced that AI is a suitable solution for complying with the Accessibility Act (Source: https://www.boersenverein.de/beratung-service/barrierefreiheit/faq/#accordion--42).