Diffusion Local Time: Stable Diffusion running on a Raspberry Pi as an AI Art Timepiece
What is this
Diffusion Local Time is a timepiece with a Generative AI display. A Raspberry Pi locally generates and displays pareidolic clock faces every several minutes, using open-source code and open-source typography and freely-available models.
Easy to customize
The clock faces are generated from four text prompts that default to California landscapes, and this is easily field-serviceable to new clock faces, like “kittens in the park”, for a 3PM viewing.
Fast: 6 minutes per image on a Pi
Using a latent consistency model derived from Stable Diffusion 1.5, and the Monster Labs QR Code Monster, a Raspberry Pi 4 can generate a 480x360 image in 9.5 minutes, suitable for a residential display. The newer Raspberry Pi 5 can generate a 480x360 image in under 6 minutes. On a Mac Studio with an M1 Ultra chip, the controlnet takes 1.1 seconds on GPU.
Diffusion Local Time was designed to use a greyscale 4:3 eink HDMI monitor, and can easily be adapted to fit other dimensions and other display technologies.
While the defaults in software and hardware for Diffusion Local Time prioritize fast and cheap at the expense of image quality, with 22.1 seconds of runtime on the GPU of an M1 Ultra you can get a much more detailed and beautiful result:
Tricky to balance and optimize
The Raspberry Pi is a well-tested deployment target, and 8 GB of memory is enough to run the Huggingface Diffusers code unchanged.1 But math takes time, and adding smaller numbers up is faster. There is no GPU on a Raspberry Pi to run 16-bit floating point math, so the easiest way to reduce precision is to go to quantized 8-bit integer precision. This is 4x less data to read in from main memory into each layer: on the Pi 4 this gave a 10% speedup, 480 to 430 seconds, even with the overhead to dequantize into floating point numbers, but on the Pi 5 this lead to no time savings, potentially because of improved memory bandwidth, so this is by default off.
The image on the bottom is in 8-bit precision, while on top is in full 32-bit precision.
This is an especially egregious example of a common phenomenon, quantization or not. Not all of the results are plausible: water reflections often obey the control image over optics, sea foam and surf often obey the control image over wave dynamics, rocks often obey the control image over gravity. As with ChatGPT, and all current language models, the output of these LLMs is something that seems statistically plausible without actual fact, and we do the work of upholding both sides of the conversation in our dialogue.
This is challenging, as an artistic project, because the tuning knob for how much to effect image synthesis with the control image, is a real-valued number, and is not a measure of the legibility of the control image: a starry image of a lake in a hot desert night tends to have more contrast than redwoods, so the conditioning scale differs, whereas the artistic intent is to have equivalent legibility among all clock faces.
The images on the bottom come from just 1GB of statistical models.
Handle with care
These LLMs are powerful. The imminent danger is how they can facilitate generating misinformation and erode the idea of consensus reality, which damages our collective ability to fight many of our modern problems.2 An urgent danger is their capacity to supplant paid stock art (using the work of photographers and artists, often those same photographers, who currently rely on the sales of their work to sustain themselves). Another urgent danger is the many biases and problems in the training data. Society is generally unprepared for the effects of this technology.
This artwork tries to grapple with these problems. There is minimal harm of a landscape altered with pillars and cairns and seafoam and extra milky ways. There are few pareidolic artists, and every image that Diffusion Local Time starts with a different random seed, based on creation time. The biases of mainstream photography of the American southwest are unfortunately replicated in this work (for example, these landscapes are gorgeous for forest bathing, but people have lived around there for millenia, and these landscapes rarely include people) and counteracting this is an ongoing effort.
The tooling that has made this artwork is powerful, and part of the point of this work is to raise awareness of the potential for its creative use and its malignant abuse. The “local” in Diffusion Local Time, especially on a Raspberry Pi, was a choicce to indicate the broad distribution of power, not only mediated by paid internet services, but locally, unmoderated, even at sub-$100 price points: this power is widely available, in such abundance that it is usable for artistic purposes, and we deserve to be appropriately cautious.
Ridiculous, like digital wristwatches, and thinking they are a pretty neat idea
In a world where most people’s primary interaction with a timepiece is through a mobile phone, which defaults to a 3 or 4 digit time display (8:30 instead of 🕣), a numerical display of time is common, but also ridiculous, in exactly the sense of “ape-descended life forms … so amazingly primitive that they still think digital wristwatches are a pretty neat idea”: to use the explanation of Douglas Adams himself, in rejecting an editorial change from digital wristwatches to cellular phones:
there is something inherently ridiculous about digital watches, and not about cellular phones. Now this is obviously a matter of opinion, but I think it’s worth explaining. Digital watches came along at a time that, in other areas, we were trying to find ways of translating purely numeric data into graphic form so that the information leapt easily to the eye. For instance, we noticed that pie charts and bar graphs often told us more about the relationships between things than tables of numbers did. So we worked hard to make our computers capable of translating numbers into graphic displays. At the same time, we each had the world’s most perfect pie chart machines strapped to our wrists, which we could read at a glance, and we suddenly got terribly excited at the idea of translating them back into numeric data, simply because we suddenly had the technology to do it. So digital watches were mere technological toys rather than significant improvements on anything that went before. I don’t happen to think that’s true of cellular comms technology. So that’s why I think that digital watches (which people still do wear) are inherently ridiculous, whereas cell phones are steps along the way to more universal communications. They may seem clumsy and old-fashioned in twenty years time because they will have been replaced by far more sophisticated pieces of technology that can do the job better, but they will not, I think, seem inherently ridiculous. 3
Let me know what you think! leebutterman@gmail.com
By replacing the new default scaled dot product attention with the existing Sliced Attention Processor”, runtime goes down from 9.5 minutes to 6 minutes. Interestingly, the default attention processor changed during development of this timepiece, which surfaced this regression, on this relatively rare deployment platform for large attention-based models. ↩
If we cannot come together for a problem that is recent, and has impact within days, and has a clearly known inexpensive solution (the US government spent $32B in the three decades before the pandemic through March 2022, about the cost of a low-end phone for each adult in the United States), how will we come together for a problem that is hundreds of years old, impacts systems with huge inertia, and has many unknown solutions that will be extremely expensive for each of us individually? ↩
That quote was from 1992, and in under twenty years there was an iPhone with an app store in roughly its modern shape, while many fewer people (especially as a percentage!) wear a digital wristwatch whose primary function is to keep time. ↩