Some people keep insisting that AI (which is a term I try to avoid, here I am calling them VLMs) cannot be good sources of alt text. My belief is that they can, even if it’s not as good as a human written one, it can be made to be better than one that’s never written, or form a good base for someone to edit. Both of those would be a net positive. I personally want to write more alt text but often don’t, and felt that something like this could be a good base for me to just read and confirm, or edit and push forward.

I tested this first on witchsky.app where I wasn’t impressed (edit: I believe they adjusted to the prompt I suggested), the result was too distracted and too verbose, I suspected this was a prompt issue so I added a feature to it (not upstreamed, but an open issue) to change the prompt.

My belief was that VLM generated alt text so far been bad not because it has to be bad, but because there’s not been enough thought put into it. So I did a wee bit of experimentation with vision-language models, some iterative tweaking, and (spoiler alert) I think it came out pretty good for an hour’s worth of effort. This isn’t meant as a “hey it’s perfect”, but rather as a “hey, there’s promise here as accessibility tech”.

(I also have the aim of using open weight models for this experiment, both for cost, and as I believe that this can be done fully offline in the future.)

FYI

I have providers that may train on inputs excluded on openrouter (you can set that here), and my training settings are off on claude as well (you can set that here). The sample data is not going to be trained on solely because of this experiment.

ground truth collection

I had claude code write a script to collect images with alt text from the bluesky firehose (a websocket that just gives you all the events, basically), with a filter on very short alt text, nsfw posts, selfies and post language english.

I let that script collect ~120 pictures, then deleted ~60 of them that had bad alt text, and was left with 68 pictures. Some that I removed were hyper-specific (required knowing character names), most had bad alt text (spam or just “a post of x” type thing), one was untagged porn.

prompt and model tweaking

I gave claude the list of most popular image-input models on openrouter, told it to exclude free models and non-open weight ones (especially openai/anthropic/google/xai).

It tested mistralai/mistral-small-2603 (mistral small 4), qwen/qwen3.5-122b-a10b, qwen/qwen3.5-27b, moonshotai/kimi-k2.5, bytedance-seed/seed-2.0-lite and qwen/qwen3.5-35b-a3b. I also had it test gemini-2.5-flash-lite as that’s what witchsky.app uses. Later I also did tests with google/gemma-4-31b-it and google/gemma-4-26b-a4b-it as well.

I contemplated on using an LLM judge vs using more traditional metrics, but as we’re working with fuzzy values, I stuck to an LLM judge.

For the LLM judge I picked mistralai/mistral-small-2603 for no reason other than “I like mistral”. Metrics to rate were Accuracy, Completeness, Conciseness, Accessibility, rated out of 5.

Overall, qwen-122b and 27b were pretty close, but 122b did better and was almost identical in cost on openrouter, so we proceeded with that (I suspect same prompt on a locally hosted 27b would be also pretty good). We then tweaked the prompt to perform the best against the weaker cases, and that got us to a prompt that strikes a very good balance of descriptiveness, minimal hallucinations and length. Aim was to get good results for both regular pictures and for pictures of text, and to minimize hallucinations.

My suggested model and prompt

My current recommendation is qwen/qwen3.5-122b-a10b with a prompt of “Write alt text for this image. Be concise — 1-2 sentences for simple images. If the image contains readable text, transcribe it rather than describing it. Only describe what you can clearly see; do not guess at names or details.”.

That got a score of Accuracy: 4.03, Completeness: 3.99, Conciseness: 4.18, Accessibility: 3.99, for an overall score of 4.04 (out of 5). All that experimentation cost me $3.359 of API fees on openrouter (that’s 3 dollars, not 3359 dollars).

Each alt text I generate costs me roughly $0.0024, so $5 of credits pay for roughly 2083 alt texts. That’s a lot!

Due to the nature of open weight models, you could also run this model locally on a server or a computer. As open weight models get better, we might be able to do this fully locally on the device as well, which is pretty exciting. Being able to do this on-device fully offline would be great.

result examples

Overall I’m pretty happy with it. I think each example I’ve seen from it so far would only require minimal tweaking to be useful. I do genuinely think that if bluesky attached these by default, it’d be significantly better than nothing.

These screenshots are from cope.works, a friend’s bluesky client where we also added support for auto-generated alt text. I believe witchsky.app is also adapting to my suggested prompt now. FWIW, here’s what the UI for it looks like within the alt text view:

bluesky alt text entry screen, but above save button is a button saying "Generate alt text with AI"

As a note, I’m trying to intersperse the good and bad cases within the results here. No point in overselling you something, I’m moreso interested in showing what works and what doesn’t.

memes

These are generally fine, but generally subjects aren’t named. For really complex ones, it may mix up details.

screenshots and diagrams

These are pretty good. It can miss points of diagrams or be unsure what to focus on with screenshots when there’s too much happening.

irl pics

Bit too focused on text. Otherwise pretty good!

video games

Same as irl pics, kind of too focused on text, but otherwise pretty good! All following screenshots are from Alan Wake 2.