The data used to train an AI model are critical to the success of the model. If data sources are limited or data quality is poor, the model will be negatively impacted. We have already seen reports of how poor quality data produces nonsense results (e.g., Google search results suggesting that humans should consume one small rock each day to acquire vitamins and minerals). AI software is designed, built, trained, and tested by humans. It will never be free of bias, mistakes, or errors. If we intend to use AI as a tool rather than a toy, we must verify the reliability and validity of the results. We must consider how results are generated. We must consider the data underlying the results. We must have criteria for accepting or rejecting results. We must move beyond the hype of AI and remember that it is simply software and not in any way magical.
Lawrence Rich’s Post
More Relevant Posts
-
🎯 Are We Running Out of Fuel for AI Innovation? 🚨 Imagine this: The data that's powering the AI revolution is drying up. That's right. A recent study by The New York Times reveals that data for AI training is disappearing at an alarming rate. 🤔 Confused? Let me break it down for you. Every time you interact with AI, whether it's a chatbot or a recommendation engine, it's like fueling a rocket with new data. But now, data owners are pulling back, wary of their information being used without benefiting them. Here's what's at stake: 📉 Limited growth and innovation in AI technologies. ⚠️ Reduced quality and effectiveness of generative AI systems. 🤝 Growing tensions between AI developers and data owners. 🚀 The takeaway? The future of AI depends on how we navigate this data dilemma. 💡 Want more insights? Check out the full article here: [The New York Times](https://lnkd.in/gudgEQ3y) 👇 What are your thoughts on the data drought in AI? Comment below! 🌟 --- Want to boost your business with tailored AI solutions that make the most of available data? Reach out to us at A-Tech AI. Let's transform your operations together. 🌐 [www.atechds.com] 📞 07 4410 9526
Data for A.I. Training Is Disappearing Fast, Study Shows - The New York Times
https://www.nytimes.com
To view or add a comment, sign in
-
HAS AI PEAKED? My opinion: going out of fashion “New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.” “The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.” “IFor years, A.I. developers were able to gather data fairly easily. But the generative A.I. boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as A.I. training fodder, or at least want to be paid for it.” “ …. there’s also a lesson here for big A.I. companies, who have treated the internet as an all-you-can-eat data buffet for years, without giving the owners of that data much of value in return.” #ai #artificialintelligence #dataprovenance #training https://lnkd.in/ev2-cNad
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
"The Data That Powers A.I. (AI) Is Disappearing Fast" This morning, I came across a very interesting read in the New York Times. The article highlights a significant shift in the landscape of AI model training. Over the past year, many of the most important web sources have restricted the use of their data, as revealed by a study from the Data Provenance Initiative, an MIT-led research group. The study examined 14,000 web domains included in three commonly used AI training datasets—C4, RefinedWeb, and Dolma—and discovered an "emerging crisis in consent." Publishers and online platforms are increasingly taking steps to prevent their data from being harvested. The researchers estimate that 5% of all data, and 25% of data from the highest-quality sources, has been restricted. These developments have profound implications. For new entrants, it raises questions about maintaining healthy competition. For AI developers, the cost of training models is likely to increase as accessible data dwindles. Moreover, this shift impacts regulatory and research governance, necessitating a balance between innovation and ethical data use. This growing data scarcity poses significant challenges for AI systems, which rely heavily on high-quality data to function effectively. As publishers and platforms tighten control over their content, AI companies face difficulties in maintaining the data flow necessary for model training and improvement. Smaller AI startups and academic researchers are particularly affected, as they may not afford to license data directly from publishers. As we navigate this evolving landscape, it is crucial to develop new tools and practices that respect data ownership while fostering AI advancements. The study underscores the importance of ethical considerations in data usage and the need for a sustainable approach to AI development. https://lnkd.in/dszZvnA7 #AI #MachineLearning #DataPrivacy #EthicalAI #Research #Technology #Innovation #DataScience #Regulation #Competition
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
The study by the Data Provenance Initiative found that many websites are now restricting data access, with 5% of all data and 25% of high-quality data sources being limited. This impacts not just AI companies but also researchers and others who rely on such data. Some publishers are blocking automated data collection or taking legal action to stop unauthorized use. This reflects the growing concerns about how data is being used for training AI models.
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
This article in the NEWYORKTIMES.COM about data powering #aimodels caught my eye (after reading many articles about the #uselections). As the article states, the limitation of data for AI model training poses significant challenges for enterprises, primarily due to the direct impact on the quality and reliability of AI outputs. AI models rely heavily on vast and diverse datasets to learn and improve. The reduction of available high-quality data, especially from the most reliable sources, means that AI models may not receive the comprehensive training necessary to perform optimally. For enterprises, this can lead to less accurate predictions, insights, and automation capabilities, ultimately hindering decision-making processes and operational efficiency. Furthermore, without access to robust datasets, the development of advanced AI applications that require nuanced understanding and complex pattern recognition could be stifled, affecting innovation and competitiveness. My colleague Anthony C. at Protegrity is leading our research and development project to strengthen data protection for enterprises concerned with #datasecurity.
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
Here’s another interesting one… a recent New York Times article by Kevin Roose highlights a growing challenge in AI development: the rapid disappearance of training data. Many websites are now restricting access to their content for AI training purposes. In fact, up to 25% of high-quality data sources have become inaccessible due to removal of consent for use by key websites and data providers. This 'consent crisis' could significantly impact AI development, especially for smaller companies and researchers. The trend reflects growing tensions between AI companies and content creators. Some major tech companies are striking deals with publishers, but this could further disadvantage smaller players. This situation raises critical questions about the future of AI development, data rights, and the balance between innovation and content protection. It’s unclear how this will play out in the long run, but it’s sure to complicate the development of new foundation models until we have a clear path forward. #AI #DataScience #TechEthics #AIDevelopment #DigitalRights #MachineLearning #TechInnovation #ContentCreation https://lnkd.in/euwn_Jvk
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
Will big corps' Generative #ai reach a plateau and get fragmented into localized / specialized models? The following article from The New York Times reveals indications for one of the predicted trajectories for Generative AI technology - more and more data that can be used for AI training are becoming out of reach. As a result, uncredited use of data for AI training by the big tech corps, which enabled the development of ChatGPT, Gemini, etc., will be soon slowed down as each of data owners build their fortresses against the ever-data-hungry ai entities and create their own monetizable models. Companies like Figma is already providing opt-out options to give more control of data usage for AI training (and yes, many argue that it should have been opt-out as default and opt-in as a choice). I believe this trend of data ownership and consent standard would become the norm for all SaaS products, soon. What effect can we expect as less and less quality data are available for AI training? 📉 Degradation of the model quality and accuracy (especially as AI increasingly learns AI-generated content/data, creating a downward quality spiral) 🌿 Fragmented market landscape with numerous products that are mixed of both authentic and scavenged data products 👍 After all, such feared, massive replacement of human jobs by AI might not come as early as we anticipate now. Specialized AI products will fill the gaps, but cannot completely cover the human roles 🎡 Scalable solutions to build localized, domain-specific, or brand-specific AI models are going to attract much more attention. #ai #data #privacy #innovation
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
Worth a read if you are doing anything in AI. An article in The New York Times about how "widespread data restrictions may pose a threat to A.I. companies, which need a steady supply of high-quality data to keep their models fresh and up-to-date." In other words, companies are shutting down access to their information to crawlers that gather this data for AI purposes. Something to pay attention to. #ai #tech #data #developer #engineers #startup #ceo #techpr
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models. Now, that data is drying up. Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group. The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. #ai #artificialintelligence #data https://lnkd.in/gghnbVgV
The Data That Powers A.I. Is Disappearing Fast
https://www.nytimes.com
To view or add a comment, sign in
-
Data challenges for leveraging AI - a simple diagram you can use to explain problems ! When Chat GPT burst on to the scene everyone scrambled to ride the AI wave. Companies and applications spun up. Yet if you search the internet the real world impact has been limited. A lot of LLM use cases has been around the periphery . True success stories have been very limited . I have been participating in a number of discussions about use of #ai in #financialservices and one thing stands out to me. Most applications have been around non core parts of the business . No - it’s not because of regulations and ethics . It’s just because of data availability. This is not unique . Surveys ( sources In comment say) less than 1/3rd of executives have their data ready for AI. Hence their application has been on summarisation , content creation and delivery. Important but not the core parts. It’s not a lack of data. The major problems are due to the following. The diagram tries to explain this. 1️⃣ Scattered : Data existing in too many systems so it’s hard to give AI access to the relevant parts without engineering work. 2️⃣ Inconsistent : values for the same data so no one really is sure which can be trusted to expose to customers 3️⃣ Unjoinable: Lack of common Ids across system so you can’t join different data sources. 4️⃣ Accessibility : A lot of real time applications need instant data . Yet this data is not available via APIs 5️⃣ Undefined: a lot of data doesn’t have definitions or origin documented. So if a data says revenue you don’t know what it really captures So I say before we do an AI revolution we do a data revolution Now. That’s the unlocker. Want to discuss strategies to do it - ping me or comment ?
To view or add a comment, sign in