NitpickLawyer 3 days ago

> I'm pretty sure Llama itself trained on a bunch of copyrighted data.

Every good, "SotA" model is trained on copyrighted data. This fact becomes aparent when models are released with everything public (i.e. training data) and they score significantly behind in every benchmark.

1
tough 1 day ago

Research team from orielly found out openai trained on copyirghted books

prob got a sub...

https://ssrc-static.s3.us-east-1.amazonaws.com/OpenAI-Traini...