Connect with us

Technology

There will soon be a “substantially better” and bigger addition to one of the biggest AI training databases in the world

Published

on

There will soon be a substantially better and bigger addition to one of the biggest AI training databases in the world

Huge corpora of AI training data have been referred to as “the backbone of large language models.” But in 2023, amid a growing outcry over the ethical and legal implications of the datasets that trained the most well-known LLMs, such as OpenAI’s GPT-4 and Meta’s Llama, EleutherAI—the company that produced one of the largest of these datasets in the world—became a target. The Pile is an 825 GB open-sourced diverse text corpora.

One of the numerous cases centered on generative AI last year involved EleutherAI, a grassroots nonprofit research group that started as a loosely organized Discord collective in 2020 and aimed to comprehend how OpenAI’s new GPT-3 functioned. In October, a lawsuit was launched by former governor of Arkansas Mike Huckabee and other authors, claiming that their books were taken without permission and added to Books3, a contentious dataset that was a part of the Pile project and featured over 180,000 titles.

However, EleutherAI is now updating the Pile dataset in partnership with other institutions, such as the University of Toronto and the Allen Institute for AI, in addition to independent researchers, and isn’t even close to finishing their dataset work. The new Pile dataset won’t be finalized for a few months, according to Stella Biderman, an executive director of EleutherAI and lead scientist and mathematician at Booz Allen Hamilton, and Aviya Skowron, head of policy and ethics at EleutherAI, in a joint interview with VentureBeat.

It is anticipated that the new Pile will be “substantially better” and larger

According to Biderman, the upcoming LLM training dataset is anticipated to surpass the previous one in size and quality, making it “substantially better.”

“There’s going to be a lot of new data,” said Biderman. Some, she said, will be data that has not been seen anywhere before and “that we’re working on kind of excavating, which is going to be really exciting.”

Compared to the original dataset, which was made available in December 2020 and used to develop language models such as the Pythia suite and Stability AI’s Stable LM suite, the Pile v2 has more recent data. It will additionally have improved preprocessing: Biderman clarified, “We had never trained an LLM before we made the Pile.” “We’ve trained nearly a dozen people now, and we know a lot more about how to clean data so that LLMs can use it.”

Better-quality and more varied data will also be included in the new dataset. She stated, “For example, we’re going to have a lot more books than the original Pile had and a more diverse representation of non-academic non-fiction domains.”

In addition to Books3, the original Pile has 22 sub-datasets, including Wikipedia, YouTube subtitles, PubMed Central, Arxiv, Stack Exchange, and, oddly enough, Enron emails. Biderman noted that the Pile is still the world’s LLM training dataset with the best creator documentation. The goal in creating the Pile was to create a massive new dataset with billions of text passages that would be comparable in size to the one that OpenAI used to train GPT-3.

When The Pile was first made public, it was a distinct AI training dataset

“Back in 2020, the Pile was a very important thing, because there wasn’t anything quite like it,” Biderman remarked. She clarified that at the time, Google was using one publicly accessible huge text corpus, C4, to train a range of language models.

“But C4 is not nearly as big as the Pile is and it’s also a lot less diverse,” she said. “It’s a really high-quality Common Crawl scrape.”

Rather, EleutherAI aimed to be more discriminating, defining the kinds of data and subjects that it intended the model to be knowledgeable about.

“That was not really something anyone had ever done before,” she explained. “75%-plus of the Pile was chosen from specific topics or domains, where we wanted the model to know things about it — let’s give it as much meaningful information as we can about the world, about things we care about.”

EleutherAI’s “general position is that model training is fair use” for copyrighted material, according to Skowron. However, they noted that “no large language model on the market is currently not trained on copyrighted data,” and that one of the Pile v2 project’s objectives is to try and overcome some of the problems with copyright and data licensing.

To represent that work, they went into detail about how the new Pile dataset was put together: Text licensed under Creative Commons, code under open source licenses, text with licenses that explicitly permit redistribution and reuse (some open access scientific articles fall into this category), text that was never within the scope of copyright in the first place, such as government documents or legal filings (such as opinions from the Supreme Court), and smaller datasets for which researchers have the express permission of the rights holders are all included in the public domain.

Following ChatGPT, criticism of AI training datasets gained traction.

The influence of AI training datasets has long been a source of concern. For instance, a 2018 study co-authored by AI researchers Joy Buolamwini and Timnit Gebru revealed that racial prejudice in AI systems was caused by big image collections. Not shortly after the public realized that well-known text-to-image generators like Midjourney and Stable Diffusion were trained on enormous image datasets that were primarily scraped from the internet, legal disputes over big image training datasets started to develop around mid-2022.

But after OpenAI’s ChatGPT was released in November 2022, criticism of the datasets used to train LLMs and image generators has increased significantly, especially with regard to copyright issues. Following a wave of lawsuits centered on generative AI from authors, publishers, and artists, the New York Times filed a lawsuit against Microsoft and OpenAI last month. Many people think this case will eventually make its way to the Supreme Court.

However, there have also been more grave and unsettling allegations lately. These include the fact that the LAION-5B image dataset was removed last month due to the discovery of thousands of images of child sexual abuse and the ease with which deepfake revenge porn could be produced thanks to the large image corpora that trained text-to-image models.

The discussion surrounding AI training data is quite intricate and subtle

According to Biderman and Skowron, the discussion surrounding AI training data is significantly more intricate and multifaceted than what the media and opponents of AI portray it as.

For example, Biderman stated that it is difficult to properly remove the photographs since the approach employed by the individuals who highlighted the LAION content is not legally accessible to the LAION organization. Furthermore, it’s possible that there aren’t enough resources to prescreen data sets for this type of photography.

“There seems to be a very big disconnect between the way organizations try to fight this content and what would make their resources useful to people who wanted to screen data sets,” she said.

When it comes to other concerns, such as the impact on creative workers whose work was used to train AI models, “a lot of them are upset and hurt,” said Biderman. “I totally understand where they’re coming from that perspective.” But she pointed out that some creatives uploaded work to the internet under permissive licenses without knowing that years later AI training datasets could use the work under those licenses, including Common Crawl.

“I think a lot of people in the 2010s, if they had a magic eight ball, would have made different licensing decisions,” she said.

However, EleutherAI lacked a magic eight ball as well. Biderman and Skowron concur that at the time the Pile was established, AI training datasets were mostly utilized for research, where licensing and copyright exemptions are somewhat extensive.

“AI technologies have very recently made a jump from something that would be primarily considered a research product and a scientific artifact to something whose primary purpose was for fabrication,” Biderman said. Google had put some of these models into commercial use in the back end in the past, she explained, but training on “very large, mostly web script data sets, this became a question very recently.”

To be fair, Skowron pointed out, legal experts like as Ben Sobel have been considering AI-related concerns as well as the legal question of “fair use” for many years. But even those at OpenAI, “who you’d think would be in the know about the product pipeline,” did not comprehend the public, commercial implications of ChatGPT that was coming down the pike, they continued.

Open datasets are safer to utilize, according to EleutherAI

While it may seem contradictory to some, Biderman and Skowron also claim that AI models trained on open datasets like the Pile are safer to use, because visibility into the data is what permits the resulting AI models to be securely and ethically used in a range of scenarios.

“There needs to be much more visibility in order to achieve many policy objectives or ethical ideals that people want,” said Skowron, including thorough documentation of the training at the very minimum. “And for many research questions you need actual access to the data sets, including those that are very much of, of interest to copyright holders such as such as memorization.”

For the time being, Biderman, Skowron, and EleutherAI colleagues keep working on the Pile’s update.

“It’s been a work in progress for about a year and a half and it’s been a meaningful work in progress for about two months — I am optimistic that we will train and release models this year,” said Biderman. “I’m curious to see how big a difference this makes. If I had to guess…it will make a small but meaningful one.”

Technology

Kudos Secures $10.2 Million for Its AI-Powered Smart Wallet

Published

on

The Four Cities Fund, Samsung Next, SV Angel, Precursor Ventures, The Mini Fund, Newtype Ventures, Patron, and The Points Guy creator Brian Kelly all participated in the funding round.

Kudos, an app and browser extension, was founded in 2001 by a group with prior expertise at Google, PayPal, and Affirm. It functions as a smart wallet assistant by suggesting or choosing the best credit card for customers to use when making payments in order to optimize rewards and cash back.

Recently, the company introduced a number of new features: Dream Wallet, which suggests cards to members based on their spending patterns; MariaGPT, an AI-powered card discovery tool with over 3000 cards in its database; and Kudos Boost, which offers personalized rewards across over 15,000 partner brands, such as Walmart and Sephora.

Since its initial fundraising round, Kudos has raised its annualized checkout Gross Merchandise Value to $200 million and expanded to over 200,000 registered users.

It intends to use the additional funds to develop MariaGPT into a comprehensive personal finance assistant, introduce an AI-powered hub offering expenditure optimization insights, and create a gateway that lets users book flights using points.

As consumers budgets, various credit cards, and sometimes complex rewards programs, they want to know they’re receiving the best value for their money, according to Tikue Anazodo, CEO of Kudos. With just one user-friendly app and extension, Kudos streamlines everything.”

Continue Reading

Technology

Clarivate Unveils AI-Powered Solution for Faster Trademark Monitoring

Published

on

At the 2024 International Trademark Association Annual Meeting, a premier worldwide source of transformational intelligence, unveiled Trademark Watch Analyzer’s first publicly accessible edition today. This solution combines in-house IP knowledge, state-of-the-art AI technology, and Clarivate global trademark and case law data to provide the next generation of trademark protection, enhanced by artificial intelligence (AI) and cloud technology. By automating crucial trademark monitoring operations and wisely choosing result sets, it will deliver quicker and more accurate responses to important business problems.

Users can access data from over 7 million trademark litigation cases and trademark datasets from 191 official trademark registers in 258 countries and territories by using Trademark Watch Analyzer. Through the use of AI algorithms that query, connect, and mine both databases to provide enhanced insights into supported watch products, this content is harmonized and connected. By allowing clients to rank findings according to their likelihood of success or opposition, this completely changes the way trademark watch results are presented.

According to Gordon Samson, President of Clarivate’s Intellectual Property, “Trademark professionals face challenges including more data, less context, and shorter deadlines as the global business landscape grows more complex.” “Our cutting-edge AI-driven technology saves time, money, and vital resources while empowering clients to confidently monitor their trademarks anywhere in the globe with automated alerts and global monitoring. The Trademark Watch Analyzer’s release is the most recent manifestation of our Think forwardTM pledge, which links customers to reliable information to guarantee an IP-powered future.”

Clients will be able to deal with their results with much greater ease because to Trademark Watch Analyzer’s more user-friendly design and interface. With a more unified user experience across the Clarivate product suite, the navigation will be built on the same architecture as the Brand Landscape Analyzer, which will arrive in 2023.

Continue Reading

Technology

AI Product Launch and Student Training Planned by Tech Company

Published

on

Announcing the release of their AI product, “Elite Global AI,” is Vwakpor Efuetanu, the chairman of Elite Global Intelligence Technologies.

In order to provide young people with the information and abilities they need to prosper in an AI-driven society, the AI product is scheduled for a complete launch in June.

A statement by Efuetanu noted that by fostering accessibility, relatability, and financial empowerment, the brand aims to create meaningful opportunities for individuals to succeed in the evolving landscape of technology.

He stated that the company is ready to invest in artificial intelligence (AI) and that young people and professionals will have access to tools through the product that will improve their performance and productivity in business, education, and the workplace through self-development programs, artificial intelligence, and vocational and skill learning.

It is intended to completely transform Africa’s innovation and opportunity environments. Elite Global AI’s latest offering, which is dedicated to democratizing AI access and promoting socioeconomic development, is a significant step towards elevating Africa to the forefront of global AI innovation.

Using artificial intelligence to its full potential, this ground-breaking product has the potential to unlock young people’s unrealized potential for inclusive development and economic prosperity while also tackling some of the continent’s most urgent issues, like unemployment and a shortage of skilled labor. Innovative solutions for career development, job skill acquisition, workplace assistant tools, and productivity enhancement are all provided by Elite Global AI’s offering.

“By making AI technology more widely accessible, it will enable African communities to harness the potential of innovation and effect positive change. He stated that the organization is providing people with the information and abilities necessary to prosper in an AI-driven society through training programs, workshops, and educational initiatives.

Speaking about the launch, Elite Global AI’s Brand and Strategy Director, Emmanuel Agida, claimed that the product’s cutting-edge capabilities, moral foundation, and cooperative spirit may change people’s lives, strengthen communities, and open up new avenues for advancement and prosperity.

“We are very thrilled to launch our latest artificial intelligence product, marking a noteworthy achievement in our endeavor to promote socioeconomic development throughout Africa and democratize access to AI technology.”

Over 10,000 young children in Nigeria are to receive AI literacy training from the group in the upcoming weeks and months. The company has already trained over 1,000 students, and in order to realize this ambition, it will collaborate with educational institutions, governmental bodies, and other significant stakeholders,” he continued.

Continue Reading

Trending

error: Content is protected !!