Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> the popular public ones are mostly trained on stolen/pirated texts offthe internet

You mean like actual literature, textbooks and scientific papers? You can't get them in bulk without pirating. Thank intellectual property laws.

> from social media clouds the companies control

I.e. conversations of real people about matters of real life.

But if it satisfies your elitist, ivory-towerish vision of "healthy information diet" for LLMs, then consider that e.g. Twitter is where, until now, you'd get most updates from the best minds in several scientific fields. Or that besides r/All, the Reddit dataset also contains r/AskHistorians and other subreddits where actual experts answer questions and give first-hand accounts of things.

The actually important bit though, is that LLM training manages to extract value from both the "bullshit" and whatever you'd call "not bullshit", as the model has to learn to work with natural language just as much as it has to learn hard facts or scientific theories.



Yes, I find the biggest issue in discussing the present state of AI with people outside the field, whether technical or not, is that "machine learning" had only just entered popular understanding: i.e. everyone seems ready today to talk about the limits of training a machine learning model on X limited data set, unable to extrapolate beyond it. The difference between "learning the best binary classifier on a labelled training set" and "exploring the set of all possible programs representable by a deep neural network of whatever architecture to find that which best generates all digitally recorded traces of human beings throughout history" is very far from intuitive to even specialists. I think Ilya's old public discussions of this question are the most insightful for a popular audience, explaining how and why a world model and not simply a Markov chain is necessary to solve the seemingly trivial problem of "predicting the next word in a sequence."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: