On Monday, Gizmodo found that the search giant has updated its privacy policy to disclose that various AI services, such as Bard and Cloud AI, can be trained on public data the company pulls from the web.
“Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate,” said Google spokesperson Christa Muldoon at The Verge. “This latest update clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our Principles of AI.
After being updated on July 1, 2023, Google’s privacy policy now states that “Google uses the information to improve our services and to develop new products, features, and technologies that benefit to our users and the public” and that the company may “use publicly available information to help train Google’s AI models and develop products and features such as Google Translate, Bard, and Cloud AI capabilities .”
You can see from the policy change history that the update provides more clarity about the services that will be trained using the collected data. For example, the document now says that information can be used for “AI Models” rather than “language models,” which gives Google more freedom to train and build systems alongside LLMs on your public data. And even that note is buried under an embedded link for “publicly accessible sources” under the “Your Local Information” tab of the policy that you have to click to open the relevant section.
The updated policy specifies that “publicly available information” is used to train Google’s AI products but does not say how (or if) the company can prevent copyrighted materials from being included in that pool. of data. Many publicly accessible websites have policies that prohibit data collection or web scraping for the purpose of training large language models and other AI tools. It will be interesting to see how this approach plays out with various global regulations such as GDPR that protect people against their data being used without consent, too.
The combination of these laws and growing competition in the market makes the makers of popular generative AI systems such as OpenAI’s GPT-4 more savvy about where they get the data used to train them and whether it includes of social media posts or copyrighted works of human artists and authors.
The matter of whether or not the doctrine of fair use applies to this type of application currently sits in a legal gray area. The uncertainty has sparked various lawsuits and pushed lawmakers in some countries to introduce stricter laws better equipped to control how AI companies collect and use their training data. It also raises questions about how this data is processed to ensure it doesn’t contribute to dangerous failures within AI systems, with people tasked with sorting through huge piles of data. training data that are often subjected to long hours and harsh working conditions.
Gannett, the largest newspaper publisher in the United States, is suing Google and its parent company, Alphabet, saying that advances in AI technology have helped the search giant have a monopoly on the market. in digital advertising. Products like Google’s AI search beta have also been called “plagiarism engines” and criticized for starving websites of traffic.
Meanwhile, Twitter and Reddit – two social platforms with a lot of public information – have recently been taken over drastic measures to try and prevent other companies from freely harvesting their data. The API changes and limitations placed on the platforms were met with backlash in their communities, as the anti-scraping changes negatively affected the core user experience of Twitter and Reddit.