Connect with us


There will soon be a “substantially better” and bigger addition to one of the biggest AI training databases in the world



There will soon be a substantially better and bigger addition to one of the biggest AI training databases in the world

Huge corpora of AI training data have been referred to as “the backbone of large language models.” But in 2023, amid a growing outcry over the ethical and legal implications of the datasets that trained the most well-known LLMs, such as OpenAI’s GPT-4 and Meta’s Llama, EleutherAI—the company that produced one of the largest of these datasets in the world—became a target. The Pile is an 825 GB open-sourced diverse text corpora.

One of the numerous cases centered on generative AI last year involved EleutherAI, a grassroots nonprofit research group that started as a loosely organized Discord collective in 2020 and aimed to comprehend how OpenAI’s new GPT-3 functioned. In October, a lawsuit was launched by former governor of Arkansas Mike Huckabee and other authors, claiming that their books were taken without permission and added to Books3, a contentious dataset that was a part of the Pile project and featured over 180,000 titles.

However, EleutherAI is now updating the Pile dataset in partnership with other institutions, such as the University of Toronto and the Allen Institute for AI, in addition to independent researchers, and isn’t even close to finishing their dataset work. The new Pile dataset won’t be finalized for a few months, according to Stella Biderman, an executive director of EleutherAI and lead scientist and mathematician at Booz Allen Hamilton, and Aviya Skowron, head of policy and ethics at EleutherAI, in a joint interview with VentureBeat.

It is anticipated that the new Pile will be “substantially better” and larger

According to Biderman, the upcoming LLM training dataset is anticipated to surpass the previous one in size and quality, making it “substantially better.”

“There’s going to be a lot of new data,” said Biderman. Some, she said, will be data that has not been seen anywhere before and “that we’re working on kind of excavating, which is going to be really exciting.”

Compared to the original dataset, which was made available in December 2020 and used to develop language models such as the Pythia suite and Stability AI’s Stable LM suite, the Pile v2 has more recent data. It will additionally have improved preprocessing: Biderman clarified, “We had never trained an LLM before we made the Pile.” “We’ve trained nearly a dozen people now, and we know a lot more about how to clean data so that LLMs can use it.”

Better-quality and more varied data will also be included in the new dataset. She stated, “For example, we’re going to have a lot more books than the original Pile had and a more diverse representation of non-academic non-fiction domains.”

In addition to Books3, the original Pile has 22 sub-datasets, including Wikipedia, YouTube subtitles, PubMed Central, Arxiv, Stack Exchange, and, oddly enough, Enron emails. Biderman noted that the Pile is still the world’s LLM training dataset with the best creator documentation. The goal in creating the Pile was to create a massive new dataset with billions of text passages that would be comparable in size to the one that OpenAI used to train GPT-3.

When The Pile was first made public, it was a distinct AI training dataset

“Back in 2020, the Pile was a very important thing, because there wasn’t anything quite like it,” Biderman remarked. She clarified that at the time, Google was using one publicly accessible huge text corpus, C4, to train a range of language models.

“But C4 is not nearly as big as the Pile is and it’s also a lot less diverse,” she said. “It’s a really high-quality Common Crawl scrape.”

Rather, EleutherAI aimed to be more discriminating, defining the kinds of data and subjects that it intended the model to be knowledgeable about.

“That was not really something anyone had ever done before,” she explained. “75%-plus of the Pile was chosen from specific topics or domains, where we wanted the model to know things about it — let’s give it as much meaningful information as we can about the world, about things we care about.”

EleutherAI’s “general position is that model training is fair use” for copyrighted material, according to Skowron. However, they noted that “no large language model on the market is currently not trained on copyrighted data,” and that one of the Pile v2 project’s objectives is to try and overcome some of the problems with copyright and data licensing.

To represent that work, they went into detail about how the new Pile dataset was put together: Text licensed under Creative Commons, code under open source licenses, text with licenses that explicitly permit redistribution and reuse (some open access scientific articles fall into this category), text that was never within the scope of copyright in the first place, such as government documents or legal filings (such as opinions from the Supreme Court), and smaller datasets for which researchers have the express permission of the rights holders are all included in the public domain.

Following ChatGPT, criticism of AI training datasets gained traction.

The influence of AI training datasets has long been a source of concern. For instance, a 2018 study co-authored by AI researchers Joy Buolamwini and Timnit Gebru revealed that racial prejudice in AI systems was caused by big image collections. Not shortly after the public realized that well-known text-to-image generators like Midjourney and Stable Diffusion were trained on enormous image datasets that were primarily scraped from the internet, legal disputes over big image training datasets started to develop around mid-2022.

But after OpenAI’s ChatGPT was released in November 2022, criticism of the datasets used to train LLMs and image generators has increased significantly, especially with regard to copyright issues. Following a wave of lawsuits centered on generative AI from authors, publishers, and artists, the New York Times filed a lawsuit against Microsoft and OpenAI last month. Many people think this case will eventually make its way to the Supreme Court.

However, there have also been more grave and unsettling allegations lately. These include the fact that the LAION-5B image dataset was removed last month due to the discovery of thousands of images of child sexual abuse and the ease with which deepfake revenge porn could be produced thanks to the large image corpora that trained text-to-image models.

The discussion surrounding AI training data is quite intricate and subtle

According to Biderman and Skowron, the discussion surrounding AI training data is significantly more intricate and multifaceted than what the media and opponents of AI portray it as.

For example, Biderman stated that it is difficult to properly remove the photographs since the approach employed by the individuals who highlighted the LAION content is not legally accessible to the LAION organization. Furthermore, it’s possible that there aren’t enough resources to prescreen data sets for this type of photography.

“There seems to be a very big disconnect between the way organizations try to fight this content and what would make their resources useful to people who wanted to screen data sets,” she said.

When it comes to other concerns, such as the impact on creative workers whose work was used to train AI models, “a lot of them are upset and hurt,” said Biderman. “I totally understand where they’re coming from that perspective.” But she pointed out that some creatives uploaded work to the internet under permissive licenses without knowing that years later AI training datasets could use the work under those licenses, including Common Crawl.

“I think a lot of people in the 2010s, if they had a magic eight ball, would have made different licensing decisions,” she said.

However, EleutherAI lacked a magic eight ball as well. Biderman and Skowron concur that at the time the Pile was established, AI training datasets were mostly utilized for research, where licensing and copyright exemptions are somewhat extensive.

“AI technologies have very recently made a jump from something that would be primarily considered a research product and a scientific artifact to something whose primary purpose was for fabrication,” Biderman said. Google had put some of these models into commercial use in the back end in the past, she explained, but training on “very large, mostly web script data sets, this became a question very recently.”

To be fair, Skowron pointed out, legal experts like as Ben Sobel have been considering AI-related concerns as well as the legal question of “fair use” for many years. But even those at OpenAI, “who you’d think would be in the know about the product pipeline,” did not comprehend the public, commercial implications of ChatGPT that was coming down the pike, they continued.

Open datasets are safer to utilize, according to EleutherAI

While it may seem contradictory to some, Biderman and Skowron also claim that AI models trained on open datasets like the Pile are safer to use, because visibility into the data is what permits the resulting AI models to be securely and ethically used in a range of scenarios.

“There needs to be much more visibility in order to achieve many policy objectives or ethical ideals that people want,” said Skowron, including thorough documentation of the training at the very minimum. “And for many research questions you need actual access to the data sets, including those that are very much of, of interest to copyright holders such as such as memorization.”

For the time being, Biderman, Skowron, and EleutherAI colleagues keep working on the Pile’s update.

“It’s been a work in progress for about a year and a half and it’s been a meaningful work in progress for about two months — I am optimistic that we will train and release models this year,” said Biderman. “I’m curious to see how big a difference this makes. If I had to guess…it will make a small but meaningful one.”


Biden, Kishida Secure Support from Amazon and Nvidia for $50 Million Joint AI Research Program



As the two countries seek to enhance cooperation around the rapidly advancing technology, President Joe Biden and Japanese Prime Minister Fumio Kishida have enlisted Inc. and Nvidia Corp. to fund a new joint artificial intelligence research program.

A senior US official briefed reporters prior to Wednesday’s official visit at the White House, stating that the $50 million project will be a collaborative effort between Tsukuba University outside of Tokyo and the University of Washington in Seattle. A separate collaborative AI research program between Carnegie Mellon University in Pittsburgh and Tokyo’s Keio University is also being planned by the two nations.

The push for greater research into artificial intelligence comes as the Biden administration is weighing a series of new regulations designed to minimize the risks of AI technology, which has developed as a key focus for tech companies. The White House announced late last month that federal agencies have until the end of the year to determine how they will assess, test, and monitor the impact of government use of AI technology.

In addition to the university-led projects, Microsoft Corp. announced on Tuesday that it would invest $2.9 billion to expand its cloud computing and artificial intelligence infrastructure in Japan. Brad Smith, the president of Microsoft, met with Kishida on Tuesday. The company released a statement announcing its intention to establish a new AI and robotics lab in Japan.

Kishida, the second-largest economy in Asia, urged American business executives to invest more in Japan’s developing technologies on Tuesday.

“Your investments will enable Japan’s economic growth — which will also be capital for more investments from Japan to the US,” Kishida said at a roundtable with business leaders in Washington.

Continue Reading


OnePlus and OPPO Collaborate with Google to Introduce Gemini Models for Enhanced Smartphone AI



As anticipated, original equipment manufacturers, or OEMs, are heavily integrating AI into their products. Google is working with OnePlus, OPPO, and other companies to integrate Gemini models into their smartphones. They intend to introduce the Gemini models on smartphones later this year, becoming the first OEMs to do so. Gemini models will go on sale later in 2024, as announced at the Google Cloud Next 24 event. Gemini models are designed to provide users with an enhanced artificial intelligence (AI) experience on their gadgets.

Customers in China can now create AI content on-the-go with devices like the OnePlus 12 and OPPO Find X7 thanks to OnePlus and OPPO’s Generative AI models.

The AI Eraser tool was recently made available to all OnePlus customers worldwide. This AI-powered tool lets users remove unwanted objects from their photos. For OnePlus and OPPO, AI Eraser is only the beginning.

In the future, the businesses hope to add more AI-powered features like creating original social media content and summarizing news stories and audio.

AndesGPT LLM from OnePlus and OPPO powers AI Eraser. Even though the Samsung Galaxy S24 and Google Pixel 8 series already have this feature, it is still encouraging to see OnePlus and OPPO taking the initiative to include AI capabilities in their products.

OnePlus and OPPO devices will be able to provide customers with a more comprehensive and sophisticated AI experience with the release of the Gemini models. It is important to remember that OnePlus and OPPO already power the Trinity Engine, which makes using phones incredibly smooth, and use AI and computational mathematics to enhance mobile photography.

By 2024, more original equipment manufacturers should have AI capabilities on their products. This is probably going to help Google because OEMs will use Gemini as the foundation upon which to build their features.

Continue Reading


Meta Explores AI-Enabled Search Bar on Instagram



In an attempt to expand the user base for its generative AI-powered products, Meta is moving forward. The business is experimenting with inserting Meta AI into the Instagram search bar for both chat with AI and content discovery, in addition to testing the chatbot Meta AI with users in nations like India on WhatsApp.

When you type a query into the search bar, Meta AI initiates a direct message (DM) exchange in which you can ask questions or respond to pre-programmed prompts. Aravind Srinivas, CEO of Perplexity AI, pointed out that the prompt screen’s design is similar to the startup’s search screen.

Plus, it might make it easier for you to find fresh Instagram content. As demonstrated in a user-posted video on Threads, you can search for Reels related to a particular topic by tapping on a prompt such as “Beautiful Maui sunset Reels.”

Additionally, TechCrunch spoke with a few users who had the ability to instruct Meta AI to look for recommendations for Reels.

By using generative AI to surface new content from networks like Instagram, Meta hopes to go beyond text generation.

With TechCrunch, Meta verified the results of its Instagram AI experiment. But the company didn’t say whether or not it uses generative AI technology for search.

A Meta representative told TechCrunch, “We’re testing a range of our generative AI-powered experiences publicly in a limited capacity. They are under development in varying phases.”

There are a ton of posts available discussing Instagram search quality. It is therefore not surprising that Meta would want to enhance search through the use of generative AI.

Furthermore, Instagram should be easier to find than TikTok, according to Meta. In order to display results from Reddit and TikTok, Google unveiled a new perspectives feature last year. Instagram is developing a feature called “Visibility off Instagram” that could allow posts to appear in search engine results, according to reverse engineer Alessandro Paluzzi, who made this discovery earlier this week on X.

Continue Reading


error: Content is protected !!