Technology

There will soon be a “substantially better” and bigger addition to one of the biggest AI training databases in the world

Published

7 months ago

January 12, 2024

Komal

There will soon be a substantially better and bigger addition to one of the biggest AI training databases in the world

Huge corpora of AI training data have been referred to as “the backbone of large language models.” But in 2023, amid a growing outcry over the ethical and legal implications of the datasets that trained the most well-known LLMs, such as OpenAI’s GPT-4 and Meta’s Llama, EleutherAI—the company that produced one of the largest of these datasets in the world—became a target. The Pile is an 825 GB open-sourced diverse text corpora.

One of the numerous cases centered on generative AI last year involved EleutherAI, a grassroots nonprofit research group that started as a loosely organized Discord collective in 2020 and aimed to comprehend how OpenAI’s new GPT-3 functioned. In October, a lawsuit was launched by former governor of Arkansas Mike Huckabee and other authors, claiming that their books were taken without permission and added to Books3, a contentious dataset that was a part of the Pile project and featured over 180,000 titles.

However, EleutherAI is now updating the Pile dataset in partnership with other institutions, such as the University of Toronto and the Allen Institute for AI, in addition to independent researchers, and isn’t even close to finishing their dataset work. The new Pile dataset won’t be finalized for a few months, according to Stella Biderman, an executive director of EleutherAI and lead scientist and mathematician at Booz Allen Hamilton, and Aviya Skowron, head of policy and ethics at EleutherAI, in a joint interview with VentureBeat.

It is anticipated that the new Pile will be “substantially better” and larger

According to Biderman, the upcoming LLM training dataset is anticipated to surpass the previous one in size and quality, making it “substantially better.”

“There’s going to be a lot of new data,” said Biderman. Some, she said, will be data that has not been seen anywhere before and “that we’re working on kind of excavating, which is going to be really exciting.”

Compared to the original dataset, which was made available in December 2020 and used to develop language models such as the Pythia suite and Stability AI’s Stable LM suite, the Pile v2 has more recent data. It will additionally have improved preprocessing: Biderman clarified, “We had never trained an LLM before we made the Pile.” “We’ve trained nearly a dozen people now, and we know a lot more about how to clean data so that LLMs can use it.”

Better-quality and more varied data will also be included in the new dataset. She stated, “For example, we’re going to have a lot more books than the original Pile had and a more diverse representation of non-academic non-fiction domains.”

In addition to Books3, the original Pile has 22 sub-datasets, including Wikipedia, YouTube subtitles, PubMed Central, Arxiv, Stack Exchange, and, oddly enough, Enron emails. Biderman noted that the Pile is still the world’s LLM training dataset with the best creator documentation. The goal in creating the Pile was to create a massive new dataset with billions of text passages that would be comparable in size to the one that OpenAI used to train GPT-3.

When The Pile was first made public, it was a distinct AI training dataset

“Back in 2020, the Pile was a very important thing, because there wasn’t anything quite like it,” Biderman remarked. She clarified that at the time, Google was using one publicly accessible huge text corpus, C4, to train a range of language models.

“But C4 is not nearly as big as the Pile is and it’s also a lot less diverse,” she said. “It’s a really high-quality Common Crawl scrape.”

Rather, EleutherAI aimed to be more discriminating, defining the kinds of data and subjects that it intended the model to be knowledgeable about.

“That was not really something anyone had ever done before,” she explained. “75%-plus of the Pile was chosen from specific topics or domains, where we wanted the model to know things about it — let’s give it as much meaningful information as we can about the world, about things we care about.”

EleutherAI’s “general position is that model training is fair use” for copyrighted material, according to Skowron. However, they noted that “no large language model on the market is currently not trained on copyrighted data,” and that one of the Pile v2 project’s objectives is to try and overcome some of the problems with copyright and data licensing.

To represent that work, they went into detail about how the new Pile dataset was put together: Text licensed under Creative Commons, code under open source licenses, text with licenses that explicitly permit redistribution and reuse (some open access scientific articles fall into this category), text that was never within the scope of copyright in the first place, such as government documents or legal filings (such as opinions from the Supreme Court), and smaller datasets for which researchers have the express permission of the rights holders are all included in the public domain.

Following ChatGPT, criticism of AI training datasets gained traction.

The influence of AI training datasets has long been a source of concern. For instance, a 2018 study co-authored by AI researchers Joy Buolamwini and Timnit Gebru revealed that racial prejudice in AI systems was caused by big image collections. Not shortly after the public realized that well-known text-to-image generators like Midjourney and Stable Diffusion were trained on enormous image datasets that were primarily scraped from the internet, legal disputes over big image training datasets started to develop around mid-2022.

But after OpenAI’s ChatGPT was released in November 2022, criticism of the datasets used to train LLMs and image generators has increased significantly, especially with regard to copyright issues. Following a wave of lawsuits centered on generative AI from authors, publishers, and artists, the New York Times filed a lawsuit against Microsoft and OpenAI last month. Many people think this case will eventually make its way to the Supreme Court.

However, there have also been more grave and unsettling allegations lately. These include the fact that the LAION-5B image dataset was removed last month due to the discovery of thousands of images of child sexual abuse and the ease with which deepfake revenge porn could be produced thanks to the large image corpora that trained text-to-image models.

The discussion surrounding AI training data is quite intricate and subtle

According to Biderman and Skowron, the discussion surrounding AI training data is significantly more intricate and multifaceted than what the media and opponents of AI portray it as.

For example, Biderman stated that it is difficult to properly remove the photographs since the approach employed by the individuals who highlighted the LAION content is not legally accessible to the LAION organization. Furthermore, it’s possible that there aren’t enough resources to prescreen data sets for this type of photography.

“There seems to be a very big disconnect between the way organizations try to fight this content and what would make their resources useful to people who wanted to screen data sets,” she said.

When it comes to other concerns, such as the impact on creative workers whose work was used to train AI models, “a lot of them are upset and hurt,” said Biderman. “I totally understand where they’re coming from that perspective.” But she pointed out that some creatives uploaded work to the internet under permissive licenses without knowing that years later AI training datasets could use the work under those licenses, including Common Crawl.

“I think a lot of people in the 2010s, if they had a magic eight ball, would have made different licensing decisions,” she said.

However, EleutherAI lacked a magic eight ball as well. Biderman and Skowron concur that at the time the Pile was established, AI training datasets were mostly utilized for research, where licensing and copyright exemptions are somewhat extensive.

“AI technologies have very recently made a jump from something that would be primarily considered a research product and a scientific artifact to something whose primary purpose was for fabrication,” Biderman said. Google had put some of these models into commercial use in the back end in the past, she explained, but training on “very large, mostly web script data sets, this became a question very recently.”

To be fair, Skowron pointed out, legal experts like as Ben Sobel have been considering AI-related concerns as well as the legal question of “fair use” for many years. But even those at OpenAI, “who you’d think would be in the know about the product pipeline,” did not comprehend the public, commercial implications of ChatGPT that was coming down the pike, they continued.

Open datasets are safer to utilize, according to EleutherAI

While it may seem contradictory to some, Biderman and Skowron also claim that AI models trained on open datasets like the Pile are safer to use, because visibility into the data is what permits the resulting AI models to be securely and ethically used in a range of scenarios.

“There needs to be much more visibility in order to achieve many policy objectives or ethical ideals that people want,” said Skowron, including thorough documentation of the training at the very minimum. “And for many research questions you need actual access to the data sets, including those that are very much of, of interest to copyright holders such as such as memorization.”

For the time being, Biderman, Skowron, and EleutherAI colleagues keep working on the Pile’s update.

“It’s been a work in progress for about a year and a half and it’s been a meaningful work in progress for about two months — I am optimistic that we will train and release models this year,” said Biderman. “I’m curious to see how big a difference this makes. If I had to guess…it will make a small but meaningful one.”

Related Topics:AI training data Copyright Issues Eleuther AI Ethics in AI GPT-4 LLMs Pile Dataset

Up Next

OpenAI introduces its bespoke chatbots GPT shop

Don't Miss

AMD and Intel made a promise that AI-powered computers would challenge Nvidia’s hegemony in chips

Komal

Technology

OpenAI Launches SearchGPT, a Search Engine Driven by AI

Published

18 hours ago

July 26, 2024

Archana Suryawanshi

The highly anticipated launch of SearchGPT, an AI-powered search engine that provides real-time access to information on the internet, by OpenAI is being made public.

“What are you looking for?” appears in a huge text box at the top of the search engine. However, SearchGPT attempts to arrange and make sense of the links rather than just providing a bare list of them. In one instance from OpenAI, the search engine provides a synopsis of its discoveries regarding music festivals, accompanied by succinct summaries of the events and an attribution link.

Another example describes when to plant tomatoes before decomposing them into their individual types. You can click the sidebar to access more pertinent resources or pose follow-up questions once the results are displayed.

At present, SearchGPT is merely a “prototype.” According to OpenAI spokesman Kayla Wood, the service, which is powered by the GPT-4 family of models, will initially only be available to 10,000 test users. According to Wood, OpenAI uses direct content feeds and collaborates with outside partners to provide its search results. Eventually, the search functions should be integrated right into ChatGPT.

It’s the beginning of what may grow to be a significant challenge to Google, which has hurriedly integrated AI capabilities into its search engine out of concern that customers might swarm to rival firms that provide the tools first. Additionally, it places OpenAI more squarely against Perplexity, a business that markets itself as an AI “answer” engine. Publishers have recently accused Perplexity of outright copying their work through an AI summary tool.

OpenAI claims to be adopting a notably different strategy, suggesting that it has noticed the backlash. The business highlighted in a blog post that SearchGPT was created in cooperation with a number of news partners, including businesses such as Vox Media, the parent company of The Verge, and the owners of The Wall Street Journal and The Associated Press. “News partners gave valuable feedback, and we continue to seek their input,” says Wood.

According to the business, publishers would be able to “manage how they appear in OpenAI search features.” They still appear in search results, even if they choose not to have their content utilized to train OpenAI’s algorithms.

According to OpenAI’s blog post, “SearchGPT is designed to help users connect with publishers by prominently citing and linking to them in searches.” “Responses have clear, in-line, named attribution and links so users know where information is coming from and can quickly engage with even more results in a sidebar with source links.”

OpenAI gains from releasing its search engine in prototype form in several ways. Additionally, it’s possible to miscredit sources or even plagiarize entire articles, as Perplexity was said to have done.

There have been rumblings about this new product for several months now; in February, The Information reported on its development, and in May, Bloomberg reported even more. A new website that OpenAI has been developing that made reference to the transfer was also seen by certain X users.

ChatGPT has been gradually getting closer to the real-time web, thanks to OpenAI. The AI model was months old when GPT-3.5 was released. OpenAI introduced Browse with Bing, a method of internet browsing for ChatGPT, last September; yet, it seems far less sophisticated than SearchGPT.

OpenAI’s quick progress has brought millions of users to ChatGPT, but the company’s expenses are mounting. According to a story published in The Information this week, OpenAI’s expenses for AI training and inference might total $7 billion this year. Compute costs will also increase due to the millions of people using ChatGPT’s free edition. When SearchGPT first launches, it will be available for free. However, as of right now, it doesn’t seem to have any advertisements, so the company will need to find a way to make money soon.

Technology

Google Revokes its Intentions to stop Accepting Cookies from Marketers

Published

4 days ago

July 23, 2024

Archana Suryawanshi

Following years of delay, Google has announced that it will no longer allow advertisers to remove and replace third-party cookies from its Chrome web browser.

Cookies are text files that websites upload to a user’s browser so they can follow them around when they visit other websites. A large portion of the digital advertising ecosystem has been powered by this practice, which makes it possible to track people across many websites in order to target ads.

Google stated in 2020 that it would stop supporting certain cookies by the beginning of 2022 after determining how to meet the demands of users, publishers, and advertisers and developing solutions to make workarounds easier.

In order to do this, Google started the “Privacy Sandbox” project in an effort to find a way to safeguard user privacy while allowing material to be freely accessible on the public internet.

In January, Google declared that it was “extremely confident” in the advancement of its plans to replace cookies. One such proposal was “Federated Learning of Cohorts,” which would essentially group individuals based on similar browsing habits; thus, only “cohort IDs”—rather than individual user IDs—would be used to target them.

However, Google extended the deadline in June 2021 to allow the digital advertising sector more time to finalize strategies for better targeted ads that respect user privacy. Then, in 2022, the firm stated that feedback had indicated that advertisers required further time to make the switch to Google’s cookie replacement because some had resisted, arguing that it would have a major negative influence on their companies.

The business announced in a blog post on Monday that it has received input from regulators and advertisers, which has influenced its most recent decision to abandon its intention to remove third-party cookies from its browser.

According to the firm, testing revealed that the change would affect publishers, advertisers, and pretty much everyone involved in internet advertising and would require “significant work by many participants.”

Anthony Chavez, vice president of Privacy Sandbox, commented, “Instead of deprecating third-party cookies, we would introduce a new experience in Chrome that lets people make an informed choice that applies across their web browsing, and they’d be able to adjust that choice at any time.” “We’re discussing this new path with regulators and will engage with the industry as we roll it out.”

Technology

Samsung Galaxy Buds 3 Pro Launch Postponed Because of Problems with Quality Control

Published

7 days ago

July 20, 2024

Archana Suryawanshi

At its Unpacked presentation on July 10, Samsung also debuted its newest flagship buds, the Galaxy Buds 3 Pro, with the Galaxy Z Fold 6, Flip 6, and the Galaxy Watch 7. Similar to its other products, the firm immediately began taking preorders for the earphones following the event, and on July 26th, they will go on sale at retail. But the Korean behemoth was forced to postpone the release of the Galaxy Buds 3 Pro and delay preorder delivery due to quality control concerns.

The Galaxy Buds 3 Pro went on sale earlier this week in South Korea, Samsung’s home market, in contrast to the rest of the world. However, allegations of problems with quality control quickly surfaced. These included loose case hinges, earbud joints that did not sit flush, blue dye blotches, scratches or scuffs on the case cover, and so on. It appears that the issues are exclusive to the white Buds 3 Pro; the silver devices are working fine.

Samsung reportedly sent out an email to stop selling Galaxy Buds 3 Pros, according to a Reddit user. These problems appear to be a result of Samsung’s inadequate quality control inspections. Numerous user complaints can also be found on its Korean community forum, where one consumer claims that the firm would enhance quality control and reintroduce the earphones on July 24.

A Samsung official stated. “There have been reports relating to a limited number of early production Galaxy Buds 3 Pro devices. We are taking this matter very seriously and remain committed to meeting the highest quality standards of our products. We are urgently assessing and enhancing our quality control processes.”

“To ensure all products meet our quality standards, we have temporarily suspended deliveries of Galaxy Buds 3 Pro devices to distribution channels to conduct a full quality control evaluation before shipments to consumers take place. We sincerely apologize for any inconvenience this may cause.”

Should Korean customers encounter problems with their Buds 3 Pro devices after they have already received them, they should bring them to the closest service center for a replacement.

Possible postponement of the US debut of the Galaxy Buds 3 Pro

Samsung seems to have rescheduled the launch date and (some) presale deliveries of the Galaxy Buds 3 Pro in the US and other markets by one month. Inspect your earbuds carefully upon delivery to make sure there are no issues with quality control, especially if your order is still scheduled for July.

The Buds 3 Pro is currently scheduled for delivery in late August, one month after its launch date, on the company’s US store. Additionally, Best Buy no longer takes preorders for the earphones, and Amazon no longer lists them for sale.

There are no quality control difficulties affecting the Buds 3, and they are still scheduled for delivery by July 24, the day of launch. Customers of the original Galaxy Buds 3 Pro have reported that taking them out is easy to tear the ear tips. Samsung’s delay, though, doesn’t seem to be related to that issue.