OpenAI Training Data to Be Inspected in Authors’ Copyright Cases-May 2024-www.zdask.com

For the first time, OpenAI will provide access to its training data for review of whether copyrighted works were used to power its technology.

In a Tuesday filing, authors suing the Sam Altman-led firm and OpenAI indicated that they came to terms on protocols for inspection of the information. Theyll seek details related to the incorporation of their works in training datasets, which could be a battleground in the case that may help establish guardrails for the creation of automated chatbots.

The agreement stems from a trio of lawsuits initiated by top authors, including Sarah Silverman, Paul Tremblay and Ta-Nehisi Coates, accusing OpenAI of harvesting mass quantities of books across the web, which were then allegedly used to produce infringing answers by ChatGPT. It comes after the court in July dismissed a claim alleging that the company engaged in unfair business practices by utilizing their works without consent or compensation. Previously, U.S. District Judge Araceli Martnez-Olgun also tossed other claims for negligence, unjust enrichment and vicarious copyright infringement, though the writers claim for direct copyright infringement remained untouched. In other cases, AI companies have denied wholesale copying of works. Rather, theyve argued that training their models involve development of parameters based on those works to define what things look like and how they should be constructed. OpenAI may advance that defense at a later stage of the authors case, as well as arguments that the practice of using published works to train its system constitutes fair use, which provides protection for the use of copyrighted material to make a secondary work as long as its transformative.

OpenAI has said that it trains its model on large, publicly available datasets that include copyrighted works. Last year, it pivoted to no longer disclosing those materials in an attempt to maintain an advantage over competitors and sidestep legal liability. While it remains unknown which works were used, the authors pointed to ChatGPT generating summaries and in-depth analyses of the themes in their novels. They claimed that the company downloaded hundreds of thousands of books from shadow library sites to train its AI system.

Under the agreement, the training datasets will be made available at OpenAIs San Francisco office on a secured computer without internet or network access. Any person wholl review the information will be required to sign a non-disclosure agreement, sign a visitors log and provide identification.

Use of any kind of technology will be severely restricted. No recording devices, including computers, cell phones or camera, will be allowed into the inspection room, per the joint stipulation. OpenAI may provide limited use of a computer to take notes, with lawyers for the authors copying those notes onto another device under the supervision of representatives for the company at the end of each day. No copies of any portion of the training data will be allowed.

The Inspecting Partys counsel and/or experts may take handwritten notes or electronic notes on the provided note-taking computer in scratch files, but may not copy any Training Data itself into any notes, the filing states.

Lawyers at the Joseph Saveri Law Firm are spearheading the litigation. They also represent authors in identical copyright lawsuits against Meta. In those cases, fact discovery is slated to end on Sept. 30, though a request for an extension has been filed. U.S. District Judge Vince Chhabria at a hearing on Friday questioned whether the attorneys can adequately represent the writers.

Its very clear to me from the papers, from the docket and from talking to the magistrate judge that you have brought this case and you have not done your job to advance it, Chhabria said, according to Politico. You and your team have barely been litigating the case. Thats obvious This is not your typical proposed class action. This is an important case. Its an important societal issue. Its important for your clients.

The concern stemmed in part from the lawyers failure to conduct any depositions in the case.

It is sometimes said that timing is everything. Well, it turns out thats true for bad timing as well, wrote U.S. District Judge Thomas Hixson. Plaintiffs request that the Court allow them to take 35 party depositions, exclusive of third-party depositions, or in the alternative they request a total of 180 hours of deposition testimony. And they made that request 18 days before the current close of fact discovery.

The judge added, Since Plaintiffs have taken zero depositions, the 35 party depositions (plus non-party depositions), or alternatively the 180 hours of deposition testimony, would all have to occur in the second half of September, which is obviously impossible.