Connect with us

Technology

There will soon be a “substantially better” and bigger addition to one of the biggest AI training databases in the world

Published

on

There will soon be a substantially better and bigger addition to one of the biggest AI training databases in the world

Huge corpora of AI training data have been referred to as “the backbone of large language models.” But in 2023, amid a growing outcry over the ethical and legal implications of the datasets that trained the most well-known LLMs, such as OpenAI’s GPT-4 and Meta’s Llama, EleutherAI—the company that produced one of the largest of these datasets in the world—became a target. The Pile is an 825 GB open-sourced diverse text corpora.

One of the numerous cases centered on generative AI last year involved EleutherAI, a grassroots nonprofit research group that started as a loosely organized Discord collective in 2020 and aimed to comprehend how OpenAI’s new GPT-3 functioned. In October, a lawsuit was launched by former governor of Arkansas Mike Huckabee and other authors, claiming that their books were taken without permission and added to Books3, a contentious dataset that was a part of the Pile project and featured over 180,000 titles.

However, EleutherAI is now updating the Pile dataset in partnership with other institutions, such as the University of Toronto and the Allen Institute for AI, in addition to independent researchers, and isn’t even close to finishing their dataset work. The new Pile dataset won’t be finalized for a few months, according to Stella Biderman, an executive director of EleutherAI and lead scientist and mathematician at Booz Allen Hamilton, and Aviya Skowron, head of policy and ethics at EleutherAI, in a joint interview with VentureBeat.

It is anticipated that the new Pile will be “substantially better” and larger

According to Biderman, the upcoming LLM training dataset is anticipated to surpass the previous one in size and quality, making it “substantially better.”

“There’s going to be a lot of new data,” said Biderman. Some, she said, will be data that has not been seen anywhere before and “that we’re working on kind of excavating, which is going to be really exciting.”

Compared to the original dataset, which was made available in December 2020 and used to develop language models such as the Pythia suite and Stability AI’s Stable LM suite, the Pile v2 has more recent data. It will additionally have improved preprocessing: Biderman clarified, “We had never trained an LLM before we made the Pile.” “We’ve trained nearly a dozen people now, and we know a lot more about how to clean data so that LLMs can use it.”

Better-quality and more varied data will also be included in the new dataset. She stated, “For example, we’re going to have a lot more books than the original Pile had and a more diverse representation of non-academic non-fiction domains.”

In addition to Books3, the original Pile has 22 sub-datasets, including Wikipedia, YouTube subtitles, PubMed Central, Arxiv, Stack Exchange, and, oddly enough, Enron emails. Biderman noted that the Pile is still the world’s LLM training dataset with the best creator documentation. The goal in creating the Pile was to create a massive new dataset with billions of text passages that would be comparable in size to the one that OpenAI used to train GPT-3.

When The Pile was first made public, it was a distinct AI training dataset

“Back in 2020, the Pile was a very important thing, because there wasn’t anything quite like it,” Biderman remarked. She clarified that at the time, Google was using one publicly accessible huge text corpus, C4, to train a range of language models.

“But C4 is not nearly as big as the Pile is and it’s also a lot less diverse,” she said. “It’s a really high-quality Common Crawl scrape.”

Rather, EleutherAI aimed to be more discriminating, defining the kinds of data and subjects that it intended the model to be knowledgeable about.

“That was not really something anyone had ever done before,” she explained. “75%-plus of the Pile was chosen from specific topics or domains, where we wanted the model to know things about it — let’s give it as much meaningful information as we can about the world, about things we care about.”

EleutherAI’s “general position is that model training is fair use” for copyrighted material, according to Skowron. However, they noted that “no large language model on the market is currently not trained on copyrighted data,” and that one of the Pile v2 project’s objectives is to try and overcome some of the problems with copyright and data licensing.

To represent that work, they went into detail about how the new Pile dataset was put together: Text licensed under Creative Commons, code under open source licenses, text with licenses that explicitly permit redistribution and reuse (some open access scientific articles fall into this category), text that was never within the scope of copyright in the first place, such as government documents or legal filings (such as opinions from the Supreme Court), and smaller datasets for which researchers have the express permission of the rights holders are all included in the public domain.

Following ChatGPT, criticism of AI training datasets gained traction.

The influence of AI training datasets has long been a source of concern. For instance, a 2018 study co-authored by AI researchers Joy Buolamwini and Timnit Gebru revealed that racial prejudice in AI systems was caused by big image collections. Not shortly after the public realized that well-known text-to-image generators like Midjourney and Stable Diffusion were trained on enormous image datasets that were primarily scraped from the internet, legal disputes over big image training datasets started to develop around mid-2022.

But after OpenAI’s ChatGPT was released in November 2022, criticism of the datasets used to train LLMs and image generators has increased significantly, especially with regard to copyright issues. Following a wave of lawsuits centered on generative AI from authors, publishers, and artists, the New York Times filed a lawsuit against Microsoft and OpenAI last month. Many people think this case will eventually make its way to the Supreme Court.

However, there have also been more grave and unsettling allegations lately. These include the fact that the LAION-5B image dataset was removed last month due to the discovery of thousands of images of child sexual abuse and the ease with which deepfake revenge porn could be produced thanks to the large image corpora that trained text-to-image models.

The discussion surrounding AI training data is quite intricate and subtle

According to Biderman and Skowron, the discussion surrounding AI training data is significantly more intricate and multifaceted than what the media and opponents of AI portray it as.

For example, Biderman stated that it is difficult to properly remove the photographs since the approach employed by the individuals who highlighted the LAION content is not legally accessible to the LAION organization. Furthermore, it’s possible that there aren’t enough resources to prescreen data sets for this type of photography.

“There seems to be a very big disconnect between the way organizations try to fight this content and what would make their resources useful to people who wanted to screen data sets,” she said.

When it comes to other concerns, such as the impact on creative workers whose work was used to train AI models, “a lot of them are upset and hurt,” said Biderman. “I totally understand where they’re coming from that perspective.” But she pointed out that some creatives uploaded work to the internet under permissive licenses without knowing that years later AI training datasets could use the work under those licenses, including Common Crawl.

“I think a lot of people in the 2010s, if they had a magic eight ball, would have made different licensing decisions,” she said.

However, EleutherAI lacked a magic eight ball as well. Biderman and Skowron concur that at the time the Pile was established, AI training datasets were mostly utilized for research, where licensing and copyright exemptions are somewhat extensive.

“AI technologies have very recently made a jump from something that would be primarily considered a research product and a scientific artifact to something whose primary purpose was for fabrication,” Biderman said. Google had put some of these models into commercial use in the back end in the past, she explained, but training on “very large, mostly web script data sets, this became a question very recently.”

To be fair, Skowron pointed out, legal experts like as Ben Sobel have been considering AI-related concerns as well as the legal question of “fair use” for many years. But even those at OpenAI, “who you’d think would be in the know about the product pipeline,” did not comprehend the public, commercial implications of ChatGPT that was coming down the pike, they continued.

Open datasets are safer to utilize, according to EleutherAI

While it may seem contradictory to some, Biderman and Skowron also claim that AI models trained on open datasets like the Pile are safer to use, because visibility into the data is what permits the resulting AI models to be securely and ethically used in a range of scenarios.

“There needs to be much more visibility in order to achieve many policy objectives or ethical ideals that people want,” said Skowron, including thorough documentation of the training at the very minimum. “And for many research questions you need actual access to the data sets, including those that are very much of, of interest to copyright holders such as such as memorization.”

For the time being, Biderman, Skowron, and EleutherAI colleagues keep working on the Pile’s update.

“It’s been a work in progress for about a year and a half and it’s been a meaningful work in progress for about two months — I am optimistic that we will train and release models this year,” said Biderman. “I’m curious to see how big a difference this makes. If I had to guess…it will make a small but meaningful one.”

Technology

Let Loose Event: The IPad Pro is Anticipated to be Apple’s first “AI-Powered Device,” Powered by the Newest M4 Chipset

Published

on

On May 7 at 7:00 am PT or 7:30 pm Indian time, Apple’s “Let Loose” event is scheduled to take place. It is anticipated that the tech giant will reveal a number of significant updates during the event, such as the introduction of new OLED iPad Pro models and the first-ever 12.9-inch iPad Air model.

The newest M4 chipset, however, may power the upcoming iPad Pro lineup, according to a new report from Bloomberg’s Mark Gurnman, just one week before the event. This is in contrast to plans to release the newest chipset along with the iMacs, MacBook Pros, and Mac minis later this year. Notably, the M2 chipset powers the iPad Pro variants of the current generation. The introduction of the M4 chipset to the new Pro lineup iterations implies that Apple is doing away with the M3 chipset entirely for Pro variants.

In addition, a new neural engine in the M4 chipset is expected to unlock new AI capabilities, and the tablet could be positioned as the first truly AI-powered device. The news comes just days after another Gurnman report revealed that Apple was once again in talks with OpenAI to bring generative AI capabilities to the iPhone.

Apple’s iPad Pro Plans:

In addition to the newest M4 chipset, Apple is anticipated to introduce an OLED panel into the iPad Pro lineup for the first time. It is anticipated that the Cupertino, California-based company will release the iPad Pro in two sizes: 13.1-inch and 11-inch.

According to earlier reports, bezels on iPad Pro models from the previous generation could be reduced by 10% to 15% as a result of the switch from LCD to OLED panels. Furthermore, it is anticipated that the next iPad Pro models will be thinner by 0.9 and 1.5 mm, respectively.

The Schedule for Apple’s WWDC:

According to Gurnman, at the Let Loose event on May 7, Apple is probably going to introduce the new iPad Pro, iPad Air, Magic keyboard, and Apple Pencil. Though Apple is planning small hands-on events for select media members in the US, UK, and Asia, the upcoming event isn’t expected to be a big in-person affair like the WWDC or iPhone launch event. Instead, it is expected to be an online program.

Continue Reading

Technology

Google Introduces AI Model for Precise Weather Forecasting

Published

on

With the confirmation of the release of an AI-based weather forecasting model that can anticipate subtle changes in the weather, Google (NASDAQ: GOOGL) is taking a bigger step into the field of artificial intelligence (AI).

Known as the Scalable Ensemble Envelope Diffusion Sampler (SEEDS), Google’s artificial intelligence (AI) model is remarkably similar to other diffusion models and popular large language models (LLMs).

In a paper published in Science Advances, it is stated that SEEDS is capable of producing ensembles of weather forecasts at a scale that surpasses that of conventional forecasting systems. The artificial intelligence system uses probabilistic diffusion models, which are similar to image and video generators like Midjourney and Stable Diffusion.

The announcement said, “We present SEEDS, [a] new AI technology to accelerate and improve weather forecasts using diffusion models.” “Using SEEDS, the computational cost of creating ensemble forecasts and improving the characterization of uncommon or extreme weather events can be significantly reduced.”

Google’s cutting-edge denoising diffusion probabilistic models, which enable it to produce accurate weather forecasts, set SEEDS apart. According to the research paper, SEEDS can generate a large pool of predictions with just one forecast from a reliable numerical weather prediction system.

When compared to weather prediction systems based on physics, SEEDS predictions show better results based on metrics such as root-mean-square error (RMSE), rank histogram, and continuous ranked probability score (CRPS).

In addition to producing better results, the report characterizes the computational cost of the model as “negligible,” meaning it cannot be compared to traditional models. According to Google Research, SEEDS offers the benefits of scalability while covering extreme events like heat waves better than its competitors.

The report stated, “Specifically, by providing samples of weather states exceeding a given threshold for any user-defined diagnostic, our highly scalable generative approach enables the creation of very large ensembles that can characterize very rare events.”

Using Technology to Protect the Environment

Many environmentalists have turned to artificial intelligence (AI) since it became widely available to further their efforts to save the environment. AI models are being used by researchers at Johns Hopkins and the National Oceanic and Atmospheric Administration (NOAA) to forecast weather patterns in an effort to mitigate the effects of pollution.

With its meteorological department eager to use cutting-edge technologies to forecast weather events like flash floods and droughts, India is likewise traveling down the same route. Equipped with cutting-edge advancements, Australia-based nonprofit ClimateForce, in collaboration with NTT Group, says it will employ artificial intelligence (AI) to protect the Daintree rainforest’s ecological equilibrium.

Continue Reading

Technology

Apple may be Introducing AI Hardware for the First time with the New IPad Pro

Published

on

With the release of the new iPad Pro, Apple is poised to accelerate its transition towards artificial intelligence (AI) hardware. With the intention of releasing the M4 chip later this year, the company is expediting its upgrades to computer processors. With its new neural engine, this chip should enable more sophisticated AI capabilities.

According to Mark Gurman of Bloomberg, the M4 chip will not only be found in Mac computers but will also be included in the upcoming iPad Pro. It appears that Apple is responding to the recent AI boom in the tech industry by positioning the iPad Pro as its first truly AI-powered device.

The new iPad Pro will be unveiled by Apple ahead of its June Worldwide Developers Conference, which will free it up to reveal its AI chip strategy. The AI apps and services that will be a part of iPadOS 18, which is anticipated later this year, are also anticipated to be utilized by the M4 chip and the new iPad Pros.

May 7 at 7:30 PM IST is when the next Let Loose event is scheduled to take place. Live streaming of the event will be available on Apple.com and the Apple TV app.

AI is also expected to play a major role in Apple’s A18 chip design for the iPhone 16. It is important to acknowledge that these recent products are not solely designed and developed with artificial intelligence in mind, and this may be a tactic employed for marketing purposes. According to reports, more sophisticated gear is on the way. Apple reportedly developed a home robot and a tablet iPad that could be controlled by a robotic arm.

Continue Reading

Trending

error: Content is protected !!