#clipfft

LIVE

aiweirdness:

Sea Shanty Surrealism

I’ve been working with an image-generating algorithm by Vadim Epstein called CLIP+FFT, which uses OpenAI’s CLIP algorithm to judge whether images match a given caption, and an FFT algorithm to come up with new images to present to CLIP. Give it any random phrase, and CLIP+FFT will try its best to come up with a matching image. And now there’s a version that will generate images to go with several phrases in a row and then fuse them into a video.

Here’s the sea shanty The Wellerman, sung by Nathan Evans, Jonny Stewart, and others, and illustrated by CLIP+FFT.

Now, there are several interesting things going on here, once you get past the sheer AI fever dream horror of it. One thing you’ll notice is that I changed some of the lines from the standard lyrics. CLIP+FFT deals with each line independently, so even if we have been talking about a ship and a whale throughout the song, the AI doesn’t know that in “when down on her a right whale bore”, the “her” refers to a ship. I made similar tweaks in one or two places.

There was nothing I could do about the line “One day, when the tonguing is done”. Trying to be more precise about the whaling sense of “tonguing” would, if anything, have made the image more horrifying.

Having none of the “Wellerman is a ship” context, the AI interprets The Wellerman itself as some kind of eldritch oil well drilling supervillain.

I kind of like what happened to “The winds blew hard, her bow dipped down,” with golden locks of hair and bows everywhere. I mean, I like it in a “oh no this has gone terribly yet fascinatingly wrong” sort of way.

The image for “We’ll take our leave and go” is also interesting, since it illustrates “leave” in so many ways. Sometimes there are cars and suitcases, or people shaking hands. Interestingly, I see hints of European Union flags and British flags in many of them, signs that during training CLIP was learning to associate “leave” with Brexit.

The “bully boys” are hilarious, classic glowering expressions and mean-kid haircuts. The AI is not used to the early-1900s meaning of “bully = awesome”

You’ll notice that many of the frames have text, which I find charming, as if the AI is frowning to itself and muttering “tea. tea. Billy. tea.” or “blow. blow.” The less interpretable the phrase is in image form, the more likely the AI is to use text instead.

In fact CLIP treating the word and the object as equivalent has led to an interesting way of fooling its image recognition capabilities:

I also had CLIP+FFT illustrate The Twelve Days of Christmas and this is one of my favorite frames from it: Ten Lords A-Leaping

To see the other illustrated Days of Christmas (including the weirdly human-faced swans), become a supporter of AI Weirdness! Or become a free subscriber to get new AI Weirdness posts in your inbox.

Every visual iteration of “the tonguing” is deeply unsettling. I love it.

lewisandquark:

It's a confusing jumble of Frodo Bagginses, all cloaked, some in pizza delivery caps, many bearing pizzas or boxes, striding out from the ominous openings of many mines

“Frodo Baggins delivering a pizza through the mines of Moria”

Remember my attempts to get CLIP+BigGAN to generate candy hearts? Here’s what an alternative method, CLIP+FFT, does with the prompt “a candy heart with a message”.

Rather than a single obsessively-scribbled-upon heart, we now have a vast universe of candy hearts, jostling against one another with their messages screaming incomprehensible love at the viewer.

There are thousands of pastel-colored hearts, each illegible, pressed up against each other into a solid cavern that recedes into misty distance.

As before, CLIP is the judge, telling another algorithm whether this collection of pixels looks more like “a candy heart with a message” than that collection of pixels. But this time, the algorithm presenting the images to CLIP isn’t steering through BigGAN, which was trained on a set of human photography. Instead, it’s doing something a lot more like the classic Deep Dream images, changing parts of the image to maximize how much it looks like dogs, or whatever the prompt is supposed to be.

First panel: Voice: How do you like this painting? Painting is of a single pine tree by a lake with a mountain. Robot box: Meh. Second panel: Voice: How about now? The mountain now has a dog face. Robot box: I'm intrigued. Third panel: Voice: How about now? Sky, lake, and pine tree all have dog faces. Robot box: THIS IS THE BEST PAINTING EVER!

(this cartoon is from my book You Look Like a Thing and I Love You: How AI Works and Why It’s Making the World a Weirder Place - out in paperback on March 23, 2021)

And since CLIP was trained on text and images that appeared together on the internet, it can be the judge of just about anything.

Here’s “A stegosaurus flying a spaceship among lasers”.

You would have to know they're stegosauruses, but they're definitely spiky, and the air is filled pretty solidly with lasers.

And it knows how to judge pop culture figures and even the look of TV shows. Here’s “Godzilla and Paul Hollywood in the Bakeoff tent”

Paul Hollywood from the Great British Bakeoff is unmistakeable, and repeated several times. Godzilla is less distinct, but vaguely godzilla-shaped and godzilla-textured

Note that it not only correctly has the tent as white and pointy-roofed, it even is trying to do the Union Jack bunting. And it’s really sensitive to the prompt, so if you type “Godzilla and Paul Hollywood taking a selfie in the Bakeoff tent” instead, Paul Hollywood breaks out into a grin and cameras appear. (it seems to be less sure what a grinning Godzilla looks like)

Now Godzilla and Paul Hollywood are seen mostly from torso up. Paul Hollywood's face is in several pieces and godilla doesn't really have a face, more of a hulking presence. Also everything has a dewy glow like a 1990s mall photo.

Here’s “Mr Darcy emerges from a lake in a white shirt while his horse looks on”

It could plausibly be Colin Firth from 1995 Pride and Prejudice. The lake and the countryside are there, repeated and tiled upon one another several times. The horse requires much more imagination.

It does less well to my mind when there are fewer clues about what the background should look like. Tell it just to do “Tyrannosaurus Rex” and things get very abstract and smeary, and it even resorts to trying to write “tyrannosaurus” everywhere.

There are some brown shapes that might be tyrannosaurus heads, definitely far too many baleful eyes, and maybe some foliage. Not sure what the swaths of lumpy red are. There's illegible writing everywhere.

“A tyrannosaurus wearing a crinoline hoop skirt on a fashion show runway” looks a bit more realistic. Or maybe that’s just my preference. The trees in the background are a nice touch.

The tyrannosaurus isn't identifiably *in* the dress per se, but there are many tiered floofy white dresses, and strong hints of tyrannosaurus legs and tails, and nice Cretaceous trees in the background. Crowds line the runway.

Here’s a zoomed-in view of one of the best ones: “a library made of bones and skeletons; a library in the style of catacombs”. It doesn’t seem to resort to word graffiti if the prompt suggests a finely textured background, maybe. (This may have been from a newer version of the CLIP+FFT notebook, so that could explain some of the improved quality.)

Image is an intricate tangle of bone-covered bookcases and piles of skulls.

You do need a bit of imagination maybe to figure out what the original prompts were, so I wouldn’t exactly say that CLIP+FFT as successful as making images to order as the original CLIP+DALL-E (still not released publicly). But having a neural net that will attempt whatever I ask for (and not turn every human into a horror of many-eyed blobs) is still pretty fun.

“The daleks have filled the tardis with llamas and David Tennant is annoyed”

David tennant is clearly visible (at least the left half of his face) frowning. Daleks and tardises seem to merge into segmented blue cones. The llamas are kind of dalek-shaped furry blobs. There are EXCESSIVE numbers of lasers for some reason.

Read more about CLIP+FFT (built by Vadim Epstein) and try it yourself for free with the colab notebook!

I made a bonus gallery of various characters delivering pizza. Spider-man’s not the only one who’s recognizable with a fresh pie in hand. To see the gallery, and get other bonus content, become an AI Weirdness supporter! Or become a free subscriber to get new AI Weirdness posts in your inbox.

loading