That's not how the scaling laws work. The number of samples required to reach a given quality level reduces exponentially over time. Most researchers use small datasets.
Interesting, because in The New York Times' lawsuit there is a very large block of text repeated verbatim. Page 30: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
How much of a copyrighted work do I have to copy and reproduce/redistribute to violate copyright law? Am I allowed to sell my handheld recording of the last two minutes of Gladiator 2 for $1.99 at the flea market?