LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images. While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos. Any researcher using the datasets must reconstruct the images data by downloading the subset they are interested in. For this purpose, we suggest the img2dataset tool.
I love the "*simply*", but doesn't it mean that (depending on country, laws etc., but generally):
1. The LAION group committed possible copyright infringements and even left undeniable evidence that they did - on top of their written testimony (dumping the "stolen goods into the river" does not make the infringement undone, does it?)
2. Any model trained on the "linked" data may commit copyright infringement.
3. As consequence, you using generated images may be liable.
I always wonder how it possibly is legal at all - considering that as a human artist if I was to copy material and remix it it without proper permission would be liable (again depending on situation), but suddenly ML is around the corner and it's all great and now you can keep remixing the potential problematic output further - no questions asked!?
I guess there are no precedence cases but why should an automaton/software (and its creators) be judged differently to persons? I don't want to spoil the fun but what am I missing?
Also disappointed that this dataset did not make sure to only collect unproblematic content like Creative Commons that
allows remixing. Would be a hell of a attribution list but definitely better than what is presented here.
EDIT: Formatting
EDIT2: I actually followed one of the projects mentioned not the linked repository. Clarified above.
If these AIs were actually just "remixing" and creating collages, then perhaps I would agree with you... but there is no exact pixel data stored here. This is fairly obvious when you consider that Stable Diffusion was trained on 100 terabytes of images yet the actual model file is 4gb.
Now I'm not saying that nothing created by these AIs should be considered copyright infringement. As a human artist, you are not judged on your process, you are judged on the end results. The same should be done for the works created by these AIs.
Bad cases make bad law - if you argue too hard in the direction of "any copyrighted material in the AI's training set makes it copyrighted" this could lead to, say, "Disney owns any animated movie made by someone who watched a Disney movie".
You can make an AI that doesn't memorize a specific training input; similarly you could probably make one that intentionally memorizes them. Both of these seem useful.
It's not simply a given that using copyright material to train a model is copyright violation.
In my view it isn't. No one image contributes a significant amount, and the process the machine is doing it analogous to that a human does when the human learns.
I strongly (actually very strongly) feel that it is ethical.
I also feel that the act of producing plagiarized content is unethical, immoral and I'd be supportive of new concepts in intellectual property law that make it illegal too.
I'm all for having these models scrutinized for copyright violations (and possibly amending copyright laws), but this comment is nothing but low-effort FUD.
I don't know, but it's quite entertaining when the output occasionally has a corrupted, but recognizable, Getty Images watermark: https://imgur.com/SmibVME
(Prompt: "A horse delivering mail in New York City, 1870")
Legally, it is uncharted territory on many levels. I think there are good arguments to be made that these systems violate the intent behind copyright and trademarks, but not necessarily the laws.
1. The LAION group committed possible copyright infringements and even left undeniable evidence that they did - on top of their written testimony (dumping the "stolen goods into the river" does not make the infringement undone, does it?)
2. Any model trained on the "linked" data may commit copyright infringement.
3. As consequence, you using generated images may be liable.
I always wonder how it possibly is legal at all - considering that as a human artist if I was to copy material and remix it it without proper permission would be liable (again depending on situation), but suddenly ML is around the corner and it's all great and now you can keep remixing the potential problematic output further - no questions asked!?
I guess there are no precedence cases but why should an automaton/software (and its creators) be judged differently to persons? I don't want to spoil the fun but what am I missing?
Also disappointed that this dataset did not make sure to only collect unproblematic content like Creative Commons that allows remixing. Would be a hell of a attribution list but definitely better than what is presented here.
EDIT: Formatting
EDIT2: I actually followed one of the projects mentioned not the linked repository. Clarified above.