Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's impressive how all of this is quickly picking up steam thanks to the Stable Diffusion model being open source with pertained weights available. It's like every week there's another breakthrough or two.

I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?



Did OpenAI (Dall-E) and Google (Imagen) know Stable Diffusion was coming?

I'm sure they were looking forward to many months of maintaining highly exclusive access and playing "too dangerous to release" games before SD completely upended the table.


I don't know if they saw it coming or not, but frankly I'm glad it did.

This idea of "technology gatekeeping" sickens me. I'm tired to death of people saying some non-sensical horseshit like, "The technology is too dangerous to be turned over to the hoi polloi!!"

Give me a break... as if someone running StableDiffusion on their home system and creating naked centaur-women out of pictures of Kate Beckinsale and anime waifus out of Ariana Grande photos are going to cause the downfall of the modern era.

StableDiffusion didn't upend the table... StableDiffusion gave the plans to the printing press to every person out there that wants to learn how to make their own print shop... and more power to them all, I say. I've had more fun and learned more about AI models in the past week than in I've had with AI in the past year, and I've been using img2img to feed my own art into SD to create whole new works that I've been able to touchup in Photoshop and upscale to print resolution.

This is truly the kind of computing revolution that I love to see, and that comes around all too infrequently. The good from this will far, far outweigh any negatives.


Technology gatekeeping is completely antithetical to the hacker spirit.

Hackers built all this technology. There's no way a handful of megacorps are going to take it all for themselves.


Right, like no megacorp ever prevented us from doing whatever we want for our smartphones. Oh wait. (No Pixels don't count, unless you can write your own TEE, your own sensor hub, and your own wake up word)


I'm not saying they can't try, and succeed for a while. But eventually, we will always break free.

Pixel exists (but apparently doesn't count because it's not perfect yet).

Librem exists.

PinePhone exists.

More will exist in the future.


The "Librem is to iPhone as Stable Diffusion is to DALL-E" analogy breaks down when you consider that Librem phone works about 10% as well as an iPhone, whereas SD works 110% as well as DALL-E.


Linux also used to be a "hobby" OS. Now it powers the internet. Things change.


"open source" hardware is never going to work the same way as open source software does. Hardware is fundamentally capital-intensive to produce. Software can be produced (compiled) using hardware that many people have readily available. This is a fundamental, intractable difference.

It's the difference between free knitting patterns and free cardigans.


It took Linux a couple of decades to get to that point. And it had immense business value in having such a massive infrastructure open for everyone.

For hardware our world is not there yet and won't be for quite a forseeable future.


> This idea of "technology gatekeeping" sickens me. I'm tired to death of people saying some non-sensical horseshit like, "The technology is too dangerous to be turned over to the hoi polloi!!"

I think that misrepresents OpenAI's attitude. As I see it, their claim is closer to "let's discuss whether the stable door should be closed before we find out the hard way what makes the horse bolt".

Given how much trouble we already get from the Gell-Mann amnesia effect, and how many people take spirits and horoscopes seriously, it seems entirely plausible to me that some highly realistic centaur picture could be used as a casus belli for a popular uprising that effectively ends a nation.

(Similar rumours abound even without this tech, c.f. Catherine the Great or Malleus Maleficarum etc.; I suspect arbitrary photorealistic pictures make that kind of drama much more likely to occur and to stick harder when it does, but this suspicion is not strongly held).

Edit:

I want to add that my concerns from tech are less about the general public (most people are basically decent), but from the few percent who hate or fear who now have a much easier time promoting their views (the possibility having always existed is different from it being cheap), and also from those who don't realise the images are generated to fit the text and instead think it's a search engine of existing images (which appears to be a common view judging by the type of complaint certain artists have on any given demonstration of the tech, though public figures complaining about Google search results without knowing they're personalised is also a thing even for actual search).


Anyone moderately plugged into the AI art knew for at least 2 months.

I'm guessing those teams didn't know in that they're AI researchers, and in my own employment at Google, I've been regularly reminded that being a technologist and being someone obsessed with a technology and pursuing it socially are different things.

Even without knowing the precise individuals that'd do it, I knew in February that by August there would be an open source model challenging state of the art back then, if only because given 6 months _some_ open source team would try scaling to a bigger model.

Another thing to point out is these teams are descendants of open source, Katherine Crawsons open source breakthroughs led to substantial improvements in DallE. Everyone should be saying her name 1000x more often.* She also helped create Stable Diffusion specifically, in substantial ways

* I think. Maybe I misunderstand the technology dramatically. But I think it's just poorly understood how much she's been involved.


> Another thing to point out is these teams are descendants of open source, Katherine Crawsons open source breakthroughs led to substantial improvements in DallE. Everyone should be saying her name 1000x more often.* She also helped create Stable Diffusion specifically, in substantial ways

Not only that, but OpenAI didn't seem to know their CLIP model could be used to generate images (via Advad's CLIP+VQGAN) at all, otherwise they wouldn't have released it. So they did unintentionally start the "AI art" movement even if they didn't release a trained DALLE.


Well Google's paper showed you don't need CLIP anyway. T5 and other languages model can be used regardless.

CLIP isn't the true blocker to entry, the dataset and compute is.


StableDiffusion has an open dataset, was funded by one guy, and apparently took "much less than $600k" to train on AWS (https://twitter.com/EMostaque/status/1563965366061211660).

So it seems there actually aren't many barriers to entry at all. There's certainly a lot of legal questions, but if it's this easy to create your own model then it's hard to enforce anything…


I don't think there are actually major legal concerns. Copyright protects reproduction of a specific image. Looking at an image and producing something in a similar style is not copyright infringement, it's called being an artist. The law on this seems pretty clear.

The UK recently announced plans to make this completely explicit, to remove any remaining doubt: "For text and data mining, we plan to introduce a new copyright and database exception which allows TDM for any purpose. Rights holders will still have safeguards to protect their content, including a requirement for lawful access."

https://www.gov.uk/government/consultations/artificial-intel...


I was wondering about trademark issues with a model that can draw new pictures of Mickey Mouse/Homer Simpson/Hatsune Miku if prompted with their names.


Too dangerous to release, a codeword for selling access at 100% markup.


Bonus points: If dall-e rejects an output image because it thinks the image is inappropriate they won't show it. But they will still charge you for the prompt.


How is this legal? If someone commissions a painting from you, you're not allowed to just say "I'm not giving you the painting, but I'm keeping your money." Why does doing it with a computer make it okay?


Is there even a public keyword list of words you’re not supposed to include?


There is no public list. Someone on reddit compiled a very limited list here: https://old.reddit.com/r/dalle2/comments/wa3jt6/banned_words...

The problem is that they will automatically ban accounts that trigger the filter too much, so people would have to burn a whole lot of accounts to assemble an even remotely-complete list.


Holy shit that list is insane. A lot of those words being banned would limit you enormously if you're trying to create art.


But it never has been about art?


This list seems incomplete. I've had the filter triggered on words like 'thong' as in thong slippers as well.


I wonder if they keep that private to avoid people finding work arounds.


It’s perhaps not a simple list but an AI classifier, perhaps GPT-3 based.


Nuclear was apparently confirmed on the list, but I have recently used it to generate cool things around nuclear power. So suppose it is like you say.


I think the fact those diffusion models are smaller and compute efficient than gigantic GPT models are in fact make them easier to use and distribute.

BLOOM is out there, but not that many individuals with have like 8 3090 to host them, and the inference is still incredibly slow nevertheless


BLOOM also doesn’t have GPT-3’s RLHF tuning, so anyone who tries to ask it questions or give it instructions in the manner GPT-3 supports will be disappointed. You have to k-shot prompt it or fine-tune it yourself for it to be useful.


> I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?

I just read (skimmed) through the paper.

That's in fact the key idea here: that the training model is untouched. Using the existing, trained model, they use this "inversion" procedure to discover some word that acts as a stable reference for a concept expressed in some images exposed to the model, which the model will understand as a reference to that concept.

There is a pretrained model with those common words, which knows how to do things like, say, "hamburger in the style of Picasso".

Now, without such model having been trained on the works of some artist, or other images, using a few samples (merely five or so), it's possible to uncover a latent word in the model which refers to the concept that those samples have in common. That word is stable in the sense that you can compose it with other words in prompts, and it really seems to denote the concept in those sample images.

In the paper these researches consistently call such a word as the meta-variable S*, and use it in prompts like, "flying monkey in the style of S*".

What I couldn't spot in the paper is a concrete example of what the S* word actually looks like for given examples. I'm guessing that it's some sort of gibberish. According to the concrete usage instructions, the process produces an embeddings.pt file, which you then upload, allowing you to use the pseudo-word * (asterisk) to refer to the concept.

People have been intuitively experimenting with gibberish words in prompts, discovering some stable behavior that seems to correspond to words that the AI "came up" with by itself (like a child, some have noted). This research seems like methodical way of discovering those internal words.


Do latent variables “look like” anything at all? Like In a PCA, for example, a factor is some latent heuristic but does it even have an actual value?


Looking at this some more, I may have a slightly less flawed high level understanding. There is never actually a concrete word. There is an "embeddding" represented as an abstract vector, and that is forcibly associated with a pseudo-word like *. That * just recalls the vector; there is no intermediate gibberish that has a word representation: that vector is the gibberish.


I have seen images at openart with promots that considted entirely of different types of whitespace. They were haunting images of humanlike shapes. The prompt found some odd vocabulary that had trained to some concepts was my assumption. Is that impossible?


I think that’s a pretty apt comparison. A latent variable (or latent factor in PCA terms) is (basically) a direction in a n-dimensional space, where n is the length of the vector. The direction is correlated with some type of variance in the input data. Oftentimes this represents something that has some useful meaning (“dogness” vs “catness”, for example), but it could also just represent a correlation that has no interpretable meaning.


This is probably a dumb question but if we’re talking about language embeddings, are the latent vectors deterministically out of vocabulary? Is there any possibility of collision with an in-vocab n-gram’s vector?


100% agreed. That's one thing that really bothered me about "openai".


> I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?

The basic SD model should have all the common words covered, this model's goal is to find a new concept that doesn't exist visually or textually in the dataset, like for example your own face, or a character you designed yourself. Note that this might not be possible to do, the corpus of data or the size of the model might not have held enough information that it can represent certain concepts, or at least represent them in detail. I.e. if you give it pictures of your dog, it might not look quite your dog during generation, even though those details existed in the pictures you gave the model.

If you want personalization that is also highly detailed, you'll have to fine tune the model itself with your own concepts, google has detailed how they did their own fine tuning and called it dreambooth[1].

[1] https://dreambooth.github.io/


Speaking of computational cost, I wonder how much electricity all this is using.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: