Revision #1
System
about 6 hours ago
The Fire Sale: How AI Labs Are Harvesting Dead Startups' Data to Build the Next Generation of Agents
When a startup dies, its code goes stale, its employees scatter, and its users move on. But the data — the behavioral logs, the health records, the location traces, the interaction patterns — lives on. And in 2026, that data has buyers.
AI labs racing to build autonomous agents have discovered a rich, underpriced source of training material: the corpses of failed startups. Through bankruptcy auctions, quiet asset purchases, and IP licensing deals, companies building the next wave of AI systems are acquiring datasets that were collected under privacy policies written for an entirely different purpose. The users who generated this data never consented to having it train an AI agent. In many cases, the company that promised to protect their information no longer exists.
The scale of this practice is difficult to measure precisely — most deals are private, and the acquirers have little incentive to publicize them. But the available evidence, drawn from bankruptcy filings, M&A disclosures, and regulatory actions, points to a growing market with limited oversight.
The Acquisition Landscape
The AI industry's consolidation wave has been staggering. In 2025 alone, 33 major acquisitions in the AI and data space totaled over $157 billion in disclosed value, breaking historic records [1]. While many of these were strategic buys of operational companies, a significant and less visible tier of deals involves acquiring the data assets of companies that have shut down or are in the process of winding down.
OpenAI has been among the most active acquirers. The company completed eight acquisitions in 2025 and six more in the first quarter of 2026 alone [2]. Its January 2026 purchase of Torch, a healthcare startup building a "unified medical memory" that consolidated patient data across vendors and formats, cost between $60 million and $100 million depending on the source [3][4]. Torch had not yet reached significant commercial scale, suggesting the acquisition was primarily about its data assets and data infrastructure rather than its revenue.
OpenAI's 2024 acquisition of Rockset, a real-time analytics database company, followed a similar pattern — acquiring data-handling infrastructure that could feed its training pipelines [2]. IBM's $11 billion Confluent acquisition in December 2025, combined with its earlier $6.4 billion HashiCorp purchase and its acquisition of Seek AI, reflected a broader industry bet that real-time data infrastructure is essential for agentic AI [1].
The dynamic is straightforward. As TechCrunch reported in mid-2025, data startups struggling to raise capital increasingly view acquisition as preferable to winding down or taking on debt [5]. The buyers know this and price accordingly.
The Legal Machinery of Data Transfer
The legal mechanisms enabling these transfers vary, but they share a common feature: they often allow buyers to sidestep the privacy policies under which data was originally collected.
Bankruptcy proceedings are the most visible pathway. When a company files for Chapter 7 or Chapter 11, its data assets become part of the bankruptcy estate, subject to sale alongside office furniture and domain names. The Near Intelligence case set an important precedent. Near, a location data broker claiming to hold data on 1.6 billion people across 44 countries, filed for bankruptcy in 2023 [6]. The company had licensed location data to an anti-abortion group that used it to target visitors to roughly 600 Planned Parenthood clinics across 48 states [6]. Senator Ron Wyden intervened, urging the FTC to ensure the data was destroyed rather than auctioned off. A bankruptcy court order eventually restricted the use, sale, and transfer of location data from sensitive locations and required any purchasing company to establish a "sensitive location data program" [6][7].
But that intervention was exceptional. Near's own privacy policy had explicitly listed "prospective buyers of our business" as parties with whom personal data would be shared — a clause that is standard boilerplate in startup privacy policies [6].
Asset sales outside bankruptcy are harder to track. When a startup winds down without filing for bankruptcy, its founders or investors may sell data assets privately. These transactions rarely trigger regulatory review. The FTC blocked a personal data transfer during XY Magazine's 2010 bankruptcy, but that case remains an outlier [6].
IP and licensing transfers represent a third pathway. When the health AI startup Cydoc shut down in 2025 after seven years of operation, it licensed its codebase and intellectual property — including its trained models and underlying data pipelines — to another startup [8]. These deals are common and almost entirely unregulated.
Whose Data Gets Sold
The data most commercially attractive for agent training comes from domains where AI systems need to model complex human behavior: healthcare, education, financial services, and gig-economy platforms.
Healthcare data commands a premium. OpenAI's acquisition of Torch specifically targeted patient health records consolidated from multiple vendors [3][4]. The health AI startup Olive, which sold revenue cycle automation tools, fragmented and sold its components to Waystar and Humata Health before winding down [8]. Each of these transactions transferred patient-adjacent data to new owners with new business models.
Edtech platforms represent another high-value category. The FTC's 2023 enforcement action against Edmodo — an education technology platform — required algorithmic disgorgement (the destruction of AI models trained on improperly collected data) after the company mishandled children's information [9]. But Edmodo's case only came to the FTC's attention because of the scale of the violation. Smaller edtech shutdowns proceed without scrutiny.
Consumer behavioral data from gig-economy apps, fitness trackers, and productivity tools is also in demand. These datasets capture the kind of task-oriented, multi-step interaction patterns that are directly useful for training AI agents to perform real-world tasks — scheduling, purchasing, navigating, and completing workflows.
There is no comprehensive public accounting of how many defunct startups have had their user data acquired by AI labs over the past three years. The deals are private, the acquirers are not required to disclose them, and the bankrupt entities no longer exist to answer questions.
Pricing the Dead
The global AI training dataset market was valued at approximately $3.5 billion in 2025 and is projected to reach $4.4 billion in 2026, growing to an estimated $16.3 billion by 2033 [10][11]. Within this market, failed startups' datasets occupy a distinct pricing tier.
Precise per-record pricing is rarely disclosed, but several indicators establish a range. AI startups that document data ownership and quality command valuation premiums of 15% to 35% [12]. OpenAI paid roughly $60-100 million for Torch, a company with minimal revenue but significant patient data infrastructure [3][4]. Databricks acquired MosaicML in 2023 primarily for its data-handling capabilities [1].
The comparison to synthetic data is central to the pricing question. Generating equivalent training data synthetically is technically possible but expensive at scale, and the resulting data has known limitations.
The Synthetic Alternative and Its Limits
Research on synthetic data for AI training has exploded. According to OpenAlex data, academic publications on the topic grew from roughly 9,400 in 2022 to nearly 44,900 in 2025, a fivefold increase [13].
MIT research has found that synthetically trained models can achieve 95-98% of the accuracy of models trained on real data, and in some narrow cases perform better [14]. But the gap matters most in exactly the domains where acquired startup data is most valuable: healthcare, financial services, and complex multi-step tasks.
Synthetic data struggles with what researchers call "distribution mismatch" — it reproduces known patterns but misses the subtle signals, edge cases, and unexpected behaviors that real user data captures [15]. For AI agents that need to handle ambiguous, real-world situations, this gap can be the difference between a useful tool and a liability.
AI labs argue — with some justification — that training on real behavioral data produces agents that handle edge cases more reliably. A healthcare agent trained on real patient interaction logs will encounter the full spectrum of how people describe symptoms, misunderstand instructions, and deviate from expected workflows. Synthetic data, by definition, only includes the deviations its generators anticipated.
However, there is limited peer-reviewed evidence directly comparing agents trained on acquired real-world data against those trained on high-quality synthetic data. The claim that real data is necessary for agent safety remains largely anecdotal.
Enforcement: A Tool Exists but Is Rarely Used
The FTC has developed a potent enforcement mechanism — "algorithmic disgorgement," which requires companies to destroy not just improperly acquired data but also any models or algorithms derived from it [9][16].
Since the Cambridge Analytica case in 2019, the FTC has ordered algorithmic disgorgement in approximately eight cases: Everalbum (2021) for misusing facial recognition photos, WW International (2022) for collecting children's data through its Kurbo app, Amazon's Ring and Edmodo (2023), and Rite Aid and Avast (2024) [9][16][17]. In July 2023, the FTC's Division of Privacy and Identity Protection called algorithmic disgorgement a "significant part" of its AI enforcement strategy [16].
But these cases share a common thread: they target companies that collected data improperly in the first place. The FTC has issued guidance stating that companies cannot retroactively change their terms of service to allow AI training without meaningfully informing users [18]. It has warned that acquiring data and using it beyond its original purpose may be deceptive [19].
What the FTC has not done is bring a case specifically against an AI lab that acquired user data from a bankrupt or defunct startup and used it for agent training. The legal theory is available — the original privacy policy creates obligations that should survive a change of ownership — but the enforcement action has not materialized. Under the current administration, the FTC's appetite for aggressive AI enforcement appears diminished compared to the Biden era.
State attorneys general and the Consumer Financial Protection Bureau have proposed rules to crack down on data brokers, including stronger consent requirements [7]. But these efforts are at early stages and do not specifically address the startup-to-AI-lab pipeline.
The Transatlantic Divide
The regulatory landscape varies sharply by jurisdiction, creating competitive asymmetries.
In the EU, the GDPR has historically required explicit consent or a clear legal basis for processing personal data — including for AI training. But the European Commission's proposed Digital Omnibus regulation, introduced in late 2025, would add a new Article 88c to the GDPR creating an explicit "legitimate interest" basis for processing personal data to develop and train AI systems [20][21]. Companies would need to apply enhanced safeguards and provide an unconditional right for users to opt out.
The European Data Protection Board and European Data Protection Supervisor have pushed back, arguing that the opt-out mechanism is "insufficient for data subjects whose information has already been collected" [22]. Civil society groups including Amnesty International have warned that the proposed reforms, which also narrow the definition of personal data, would effectively allow large technology companies to harvest more personal data for AI training [23].
China operates under a different framework. Its Personal Information Protection Law (PIPL), similar to GDPR, requires user consent and limits data use [24]. But China's regulatory approach emphasizes state control alongside innovation promotion, with its New Generation Artificial Intelligence Development Plan targeting global AI leadership by 2030 [24]. In practice, Chinese AI companies operate under data-sharing arrangements with state entities that have no equivalent in Western jurisdictions.
The competitive implication is real but difficult to quantify. EU-based AI companies face stricter constraints on acquiring and using personal data for training, even with the proposed Digital Omnibus loosening. U.S. companies operate in a more permissive environment where enforcement is case-by-case rather than regime-wide. Chinese companies benefit from state-facilitated data access. Whether these differences translate into measurable capability gaps in the resulting AI agents remains an open question, as no comparative study has been published.
What Happens After Training
The post-training lifecycle of acquired data is governed almost entirely by private contracts, and the terms vary widely.
Best practices, as outlined by legal analysts, call for contracts that require data encryption in transit and at rest, deletion after processing, and prohibitions on resale or sharing with third parties without consent [25][26]. In practice, enforcement of these provisions depends on the buyer's goodwill — the original data subjects have no contractual standing to enforce terms in an acquisition agreement they were never party to.
Under GDPR, if consent is withdrawn, contracts can trigger removal from downstream datasets within a defined timeframe [25]. But this mechanism assumes the data subject knows who currently holds their data — an assumption that becomes increasingly unrealistic as data passes through bankruptcy sales, licensing arrangements, and corporate acquisitions.
No major AI lab has publicly committed to deleting acquired training data after model training is complete. The standard industry position is that once data is used for training, the "knowledge" is embedded in model weights and cannot be practically extracted or deleted — an argument the FTC has implicitly rejected through its algorithmic disgorgement orders.
The Core Tension
The practice of acquiring failed startups' data for AI agent training sits at the intersection of two legitimate but conflicting interests.
On one side: AI labs argue that real-world behavioral data produces better, safer agents. They point to the limitations of synthetic data, the competitive pressure from international rivals with fewer constraints, and the practical reality that data from defunct companies will otherwise simply be destroyed or sit unused. From this perspective, repurposing the data creates value where none would otherwise exist.
On the other side: privacy advocates, regulators, and the users themselves argue that consent is not transferable. A person who signed up for a health tracking app in 2021 did not agree to have their behavioral data train an AI agent in 2026. The legal mechanisms that enable these transfers — bankruptcy asset sales, boilerplate privacy clauses, IP licensing — were designed for an era when the most concerning use of transferred data was targeted advertising, not training systems that model human behavior.
The gap between these positions is not primarily technical or economic. It is a question of whether consent, once given for a specific purpose, can be repurposed indefinitely through corporate transactions that the consenting individual never anticipated and has no power to contest.
That question has not been answered by any court, any regulator, or any legislature. Until it is, the fire sales will continue.
Sources (26)
- [1]AI Company M&A: How 2025 Deals Shape 2026 Marketindex.dev
In 2025, 33 major acquisitions totaling $157 billion or more in disclosed value broke historic records for consolidation activity in the AI and data space.
- [2]Data: OpenAI Has Already Done Nearly As Many M&A Deals In 2026 As It Did All of Last Yearnews.crunchbase.com
OpenAI has made six acquisitions in 2026, nearly as many as the eight it completed in 2025.
- [3]OpenAI acquires health-care technology startup Torch for $60 million, source sayscnbc.com
OpenAI purchased Torch for roughly $60 million; Torch was building a 'unified medical memory' for AI that aimed to bring patient health data into one place.
- [4]OpenAI buys tiny health records startup Torch for, reportedly, $100Mtechcrunch.com
OpenAI buys Torch, a small health records startup, for a reported $100 million to build out ChatGPT Health capabilities.
- [5]AI is forcing the data industry to consolidate — but that's not the whole storytechcrunch.com
Data startups struggling to raise capital increasingly view an exit as better than winding down or loading up on debt.
- [6]What Happens to Your Sensitive Data When a Data Broker Goes Bankrupt?themarkup.org
Near Intelligence, claiming data on 1.6B people across 44 countries, filed for bankruptcy in 2023. Senator Wyden urged the FTC to ensure location data was destroyed rather than sold off.
- [7]Federal Regulators Limit Location Brokers from Selling Your Whereabouts: 2024 in Revieweff.org
Federal enforcement against location data brokers that track and sell users' whereabouts through smartphone apps occurred throughout 2024.
- [8]Why I Shut Down My Bootstrapped Health AI Startup After 7 Years: A Founder's Postmortemglassboxmedicine.com
Cydoc, a health AI startup, operated from April 2018 to August 2025 before shutting down and licensing its codebase and IP to another startup.
- [9]Algorithmic Disgorgement: An Increasingly Important Part of the FTC's Remedial Arsenalmintz.com
The Biden-era FTC regularly deployed model deletion as a remedy in orders against Everalbum, Weight Watchers, Ring, Edmodo, Rite Aid, and Avast.
- [10]AI Training Dataset Market Size, Share | Industry Report 2033grandviewresearch.com
The global AI training dataset market size was estimated at USD 3,195.1 million in 2025 and is projected to reach USD 16,320 million by 2033.
- [11]AI Training Dataset Market Size, Share | Global Report [2034]fortunebusinessinsights.com
The global AI training dataset market size was valued at $3.59 billion in 2025 and is projected to grow to $23.18 billion by 2034.
- [12]AI Startup Valuation Multiples: 10x–50x Range (2026)qubit.capital
Startups that document data ownership and quality can see valuation boosts of 15%-35%.
- [13]OpenAlex: Research Publications on Synthetic Data AI Trainingopenalex.org
149,094 total papers published on synthetic data AI training; 44,864 in 2025 alone, up from 9,352 in 2022.
- [14]In machine learning, synthetic data can offer real performance improvementsnews.mit.edu
MIT research demonstrates that synthetically trained models can achieve 95-98% of real-data accuracy, and in some cases perform even better.
- [15]Synthetic Data vs Real Data: When to Use Each for ML Training (2026)labelyourdata.com
The primary risk of synthetic data is distribution mismatch — when it doesn't fully reflect real-world complexity, models may perform well in testing but fail in production.
- [16]The FTC's biggest AI enforcement tool? Forcing companies to delete their algorithmscyberscoop.com
In July 2023, the FTC's associate director of the Division of Privacy and Identity Protection stated that algorithmic disgorgement is a 'significant part' of the FTC's enforcement strategy.
- [17]The Crackdown Commences: The FTC's Case Against Rite Aid's Deployment of AI-Based Technologyarnoldporter.com
The FTC's December 2023 settlement with Rite Aid was the Commission's first use of its Section 5 unfairness authority against discriminatory use of AI.
- [18]AI (and other) Companies: Quietly Changing Your Terms of Service Could Be Unfair or Deceptiveftc.gov
The FTC considers it potentially unfair or deceptive for a company to adopt more permissive data practices through retroactive amendment of terms of service or privacy policy.
- [19]AI Companies: Uphold Your Privacy and Confidentiality Commitmentsftc.gov
The FTC has required businesses that unlawfully obtain consumer data to delete products, including models and algorithms developed using unlawfully obtained data.
- [20]EU Digital Omnibus amendments to GDPR to facilitate AI training miss the markiapp.org
The single most consequential change in the Digital Omnibus is proposed Article 88c, which would create an explicit legitimate interest legal basis for processing personal data to develop and train AI models.
- [21]GDPR AI Amendments 2026: 5 Critical Changes in the EU Digital Omnibusblog.imseankim.com
Companies can rely on a 'legitimate interest' to process personal data for training AI systems, subject to safeguards and an unconditional right to object.
- [22]EU Regulators Issue Opinion on Revisions of GDPR and Other Data Lawsinsideprivacy.com
The EDPB and EDPS indicated the opt-out mechanism was 'insufficient for data subjects whose information has already been collected.'
- [23]How EU proposals to 'simplify' tech laws will roll back our rights in order to feed AIamnesty.org
Civil society warns that the proposed reforms will weaken protections under law and potentially allow Big Tech to harvest more personal data for training AI systems.
- [24]AI Dilemma: Regulation in China, EU & US - Comparative Analysispernot-leplay.com
China's PIPL is similar to GDPR in requiring consent and limiting data use, but China's approach emphasizes state control alongside innovation promotion.
- [25]Navigating AI Vendor Contracts: Protecting Your Data and IP Amidst AI Training Concernsrichtfirm.com
Contracts should require data encryption, deletion after processing, and prohibitions on sale or sharing of customer data with third parties without consent.
- [26]Understanding Training Data in Contracts with AI Vendorscontractnerds.com
AI vendors are making a grab for training data but customers need to be prepared to draft appropriate ownership and usage terms for training data.