Close Menu
Earth & BeyondEarth & Beyond

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Football gossip: Anderson, Guehi, Abraham, Malen, Gallagher

    My Winter Car is a ‘dangerous, depressing and tiring’ life sim, and the developer doesn’t want you to play it unless you’ve mastered the infamously tricky My Summer Car

    Undercover Pre-Fall 2026 Menswear Collection

    Facebook X (Twitter) Instagram
    Earth & BeyondEarth & Beyond
    YouTube
    Subscribe
    • Home
    • Business
    • Entertainment
    • Gaming
    • Health
    • Lifestyle
    • Sports
    • Technology
    • Trending & Viral News
    Earth & BeyondEarth & Beyond
    Subscribe
    You are at:Home»Technology»Researchers suggest OpenAI trained AI models on paywalled O’Reilly books
    Technology

    Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

    Earth & BeyondBy Earth & BeyondApril 2, 2025004 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Email
    Researchers suggest OpenAI trained AI models on paywalled O’Reilly books
    Share
    Facebook Twitter LinkedIn Pinterest Email

    OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didn’t license to train more sophisticated AI models.

    AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

    While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

    The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

    In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

    “GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

    The paper used a method called DE-COP, first introduced in an academic study in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

    The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

    According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, specifically GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

    “GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

    It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

    Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.

    That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

    It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

    Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

    OpenAI didn’t respond to a request for comment.

    books models OpenAI OReilly paywalled Researchers suggest trained
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNorway urged to let mega wealth fund take stakes in weapons makers
    Next Article In a new book, Biden aide describes ‘out of it’ president before Trump debate | Books
    Earth & Beyond
    • Website

    Related Posts

    NASA’s Webb Delivers Unprecedented Look Into Heart of Circinus Galaxy

    January 13, 2026

    New Proposed Legislation Would Let Self-Driving Cars Operate in New York State

    January 13, 2026

    More than 100 new tech unicorns were minted in 2025 — here they are

    January 13, 2026
    Leave A Reply Cancel Reply

    Latest Post

    If you do 5 things, you’re more indecisive than most—what to do instead

    UK ministers launch investigation into blaze that shut Heathrow

    The SEC Resets Its Crypto Relationship

    How MLB plans to grow Ohtani, Dodger fandom in Japan into billions for league

    Stay In Touch
    • YouTube
    Latest Reviews

    NASA’s Webb Delivers Unprecedented Look Into Heart of Circinus Galaxy

    By Earth & BeyondJanuary 13, 2026

    New Proposed Legislation Would Let Self-Driving Cars Operate in New York State

    By Earth & BeyondJanuary 13, 2026

    More than 100 new tech unicorns were minted in 2025 — here they are

    By Earth & BeyondJanuary 13, 2026

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Blackpink Share New Song “Jump” Amid Deadline World Tour: Watch the Video

    July 13, 202528 Views

    Bitcoin in the bush – crypto mining brings power to rural areas

    March 25, 202513 Views

    Honor of Kings breaks esports attendance Guinness World Record 

    November 10, 202511 Views
    Our Picks

    Football gossip: Anderson, Guehi, Abraham, Malen, Gallagher

    My Winter Car is a ‘dangerous, depressing and tiring’ life sim, and the developer doesn’t want you to play it unless you’ve mastered the infamously tricky My Summer Car

    Undercover Pre-Fall 2026 Menswear Collection

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © 2026 Earth & Beyond.
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer

    Type above and press Enter to search. Press Esc to cancel.

    Newsletter Signup

    Subscribe to our weekly newsletter below and never miss the latest product or an exclusive offer.

    Enter your email address

    Thanks, I’m not interested