It's difficult to know what, exactly, the accord between Stack Overflow and OpenAI is. Per the meta announcement:

anything specific promised about how this will work here could change

but that's about technical details. Presumably, the legal side of things is already drawn up: that's what I'd like to know. (Or if it's not, and this is just a declaration of intent? I'd like to know that, too.) We have hints, but so far they're contradictory.

  • Rosie says that “Having credit attributed is a non-negotiable for us” and “making sure attribution is happening (in a license-compliant way) is a commitment we require and have received from our partners”.

    • Unless they've made substantial theoretical progress and are keeping it very secret, OpenAI does not have the technology to train models on our work while retaining attribution.
    • By implication, the company will not let OpenAI train their transformer models on Stack Exchange contributions, and OpenAI has committed not to.
  • But the press release says:

    OpenAI will utilize Stack Overflow’s OverflowAPI product and collaborate with Stack Overflow to improve model performance for developers who use their products. This integration will help OpenAI improve its AI models using enhanced content and feedback from the Stack Overflow community and provide attribution to the Stack Overflow community within ChatGPT to foster deeper engagement with content.

    • Does "improve its AI models" mean using our contributions as training data?
    • Unlike Wikipedia, Stack Overflow contributions cannot be attributed merely to "Stack Overflow contributors" (the usual terms of the CC BY-SA 4.0 license apply), but it sounds like that's what the press release says they'll do. Contradiction.
  • Is this side of the partnership perhaps restricted to glorified search results, à la chat oneboxes?

    OpenAI will also surface validated technical knowledge from Stack Overflow directly into ChatGPT,

    suggests they might be planning that. If that's all, then this whole fuss has been over very little.

The ambiguity helps nobody. I don't need $numbers, but I want terms. It's our work you're selling, it's our communities you're hurting. We have a right to know.

  • What is Stack Overflow providing OpenAI?
  • What are OpenAI's obligations with regards to the provided material?
    • Specifically, is Stack Exchange licensing or planning to license subscriber content to OpenAI under anything other than CC-BY-SA?
  • What are the actual words of that agreement?

If we get an answer from a lawyer, please be kind: lawyers are not usually responsible for strategic decision-making, and shooting the messenger is a bad strategy (also, it's mean). If a bigwig answers, please consider the tactical value of punching up here and now, given the context. I don't think anyone needs a reminder to be nice to CMs.

  • 20
    Can I suggest adding "Is StackExchange planning to license subscriber content to OpenAI under anything other than a CC-BY-SA?" There is some ambiguous wording in the ToS that suggests this might be possible. It would explain what OpenAI is getting out of the deal (since they have all subscriber content already under CC-BY-SA).
    – Peter
    Commented May 9, 2024 at 17:14
  • 2
    "We have a right to know." Morally, or socially, sure, but technically, I don't think so. (Which is another way the community can potentially be dealt a disastrous blow when it comes to the mutual trust and respect that has been eroding for a while now.)
    – Joachim
    Commented May 10, 2024 at 6:33
  • 4
    The best case scenario is they are selling access to an API a chatbot can use to pull in live information linking back to the network. Ideally then that money goes back into supporting public Q&A and ancillary development as well as hires. That said there's no guarantee that a cc compliant agreement means it will be fully honoured or that the revenue would be wisely reinvested into the community. Commented May 12, 2024 at 9:50
  • 3
    It is unfortunate that while this appears to have already been done, there is no real recourse for what happens when these training sets simply "clean" off the attribution as noise; as is already being done. EleutherAI even lists Stack Exchange as part of its source material, and they do note that their training material has data removed such as attribution; as is done for all content gleaned from crawling. Crawling used to be illegal and content scraped would have a DMCA removal order. Clearly that is no longer the case, and scraping is now being institutionalized.
    – Travis J
    Commented May 30, 2024 at 21:14

1 Answer 1


I understand that there are a lot of questions about our partnership with OpenAI and what this means in terms of attribution. While I can’t answer every bullet point or all of the questions on the main Meta Post announcement, I want to be clear about our attribution agreement with any Overflow API partners.

In our agreements with our partners, we have stated that the data they use is governed by the Creative Commons license.

The precise Creative Commons license that governs each piece of content depends on the date the content was published or edited (see Creative Commons Licensing UI and Data Updates - Meta Stack Exchange)

  • Content contributed before 2011-04-08 (UTC) is distributed under the terms of CC BY-SA 2.5.
  • Content contributed from 2011-04-08 up to but not including 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 3.0.
  • Content contributed on or after 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 4.0.

The dataset API partners have access to contains license information for every Post, Question, Answer, Revision, Comment, and Question_Timeline (type=revision). The history of a post will always indicate which license applies to that post.

This means that we have supplied the information for them to comply with the Creative Commons requirement to provide “appropriate credit” per Creative Commons. The definition of “appropriate credit” per Creative Commons is “the name of the creator and attribution parties, a copyright notice, a license notice, a disclaimer notice, and a link to the material. CC licenses prior to Version 4.0 also require you to provide the title of the material if supplied, and may have other slight differences.”

We are not solutioning (aka, directing or telling) how our partners meet these requirements for attribution and appropriate credit.

I hope this provides a bit more clarity.

  • 5
    this is good news. though note that this does not fully answer the question(s).
    – starball
    Commented May 14, 2024 at 18:32
  • 43
    So effectively, access to packaged licensed content was sold to openai/google under the promise that they'd respect the license, but no instructions or requirements past that are being enforced to ensure the license is being respected. Is this something the company will enforce for us? or is the community on it's own against these giants to ensure they're following the license you are selling our content under?
    – Kevin B
    Commented May 14, 2024 at 18:38
  • 7
    This misrepresents what the CC BY-SA license says. Section 3(a)(2) of the license allows the entity exercising the rights to "satisfy the conditions" of the Attribution clause "in any reasonable manner based on the medium, means, and context...". This gives a lot of flexibility. On top of this, because of 3(a)(1)(A)(v), a simple URI or hyperlink to a SE network question is sufficient. There's actually no need to credit each individual who wrote a post. So your definition of "appropriate credit" (which is never used in the CC license) doesn't match the reality of what the license exams. Commented May 14, 2024 at 18:42
  • 10
    Are there procedures for yanking revisions when moderators redact them? That only happens rarely, but it's very important most times it happens.
    – wizzwizz4
    Commented May 14, 2024 at 18:52
  • 12
    @ThomasOwens I don't think Rosie is misrepresenting anything: this is an adequate summary of the §3(a)(1)(A) obligations. §3(a)(2) says that a hyperlink may be a reasonable way to satisfy those obligations, but doesn't replace those obligations. (If, for example, they're using deleted material, where the hyperlink will point nowhere, a hyperlink is clearly not reasonable attribution.) §3(a)(1)(A)(i) says “in any reasonable manner requested by the Licensor”: by posting on Stack Exchange, I think the Stack Exchange definition of "appropriate credit" (which dates way back) can be assumed.
    – wizzwizz4
    Commented May 14, 2024 at 18:57
  • 4
    @wizzwizz4 That is incorrect. SE can't set additional terms on top of CC BY-SA, as specified in Section 2(a)(5)(C). The user of the content is responsible for determining how to meet the requirements of Section 3(a). The use of terms like "reasonable" are very vague and would likely require negotiation. However, such negotiation would need to be between the author of the post and the recipient of the licensed content - SE does not obtain permission to act on our behalf. Commented May 14, 2024 at 19:05
  • 23
    Thanks for this answer. But is this good news? As I'd understand it, this says the data OpenAI is receiving comes with the same licence as the public dump. ...Which OpenAI is already using without proper attribution. So why should we assume that they will honour the licence any more when the data comes through the API? It seems evident that the AI companies are banking on a court decision that training models is fair use, and copyright can be treated with blatant disregard. Commented May 14, 2024 at 19:24
  • 7
    @NoDataDumpNoContribution Yeah. I haven't seen or been able to get ChatGPT to attribute anything. A core function of Gemini is finding supporting material on the Internet for claims, but there are questions about if that supporting material was part of the input material and if it's really the right thing to attribute. There are also open questions about attributing the content used to train models. So plenty of open questions, but as far as I'm concerned, this is a non-answer since it misrepresents what CC BY-SA says, what SE can do, and the state of attribution in GenAI in general. Commented May 14, 2024 at 22:07
  • 4
    "We are not solutioning (aka, directing or telling) how our partners meet these requirements for attribution and appropriate credit." Does lead to a question of what happens if the partners disreguard the attribution. The nature of the feed/API allows people to do the right thing, but what happens if they choose not to? Commented May 14, 2024 at 23:16
  • 4
    To be fair, supervising how these large corporations are gaining their AI training material isn't a responsibility of SO. That's a matter for governments, and in case there's suspicions about license violation/piracy, a matter for the police. Which I suppose would be the FBI in this case, given that SO is based out of New York. So in case anyone has actual evidence of OpenAI or other corporations stealing licensed material in order to profit from it, report it to the relevant police authority.
    – Lundin
    Commented May 15, 2024 at 7:55
  • 8
    So, exactly... since you call them partner but at the same time "We are not solutioning (aka, directing or telling) how our partners meet these requirements for attribution and appropriate credit." exactly what makes the "partner" instead of users? Why would they pay for a dump that they can easily get for free like they already did before (since it is easy to demonstrate that the biggest LLM on the market include network post in their training material)? And more importantly, what about this user claim? Commented May 15, 2024 at 11:35
  • 7
    Well, that's nice legalese for saying "we don't really care". Of course SO know that all LLMs, including OpenAI's are technically and on principal grounds incapable of meeting the requirement of attribution. Looking the other way and subjecting all members of SO to a glaring misuse of their content is just not the right thing to do, ethically and legally. Not only does it destroy trust it also hurts the community badly.
    – miraculixx
    Commented May 15, 2024 at 12:23
  • 14
    It does feel like a bit of a cop-out to say you're "not solutioning (aka, directing or telling) how our partners meet these requirements for attribution and appropriate credit" This entire deal was inked and no one even bother to talk to them HOW they would potentially navigate the CC licensing of all the content covered in the agreement? It seems hard to believe that proper, license appropriate use of the content never came up. Commented May 15, 2024 at 20:29
  • 9
    @Rosie Will the company commit, in clear and binding terms, to pursue legal action should the license not be sufficiently honored? Or, if you take the stance of a lot of these comments that it's our responsibility to meet OpenAI in court, will the company commit to support community members in their suit, either by providing legal support or by filing generous amicus briefs? Anything? Any commitment at all to enforcement?
    – samuei
    Commented May 17, 2024 at 13:19
  • 5
    @samuei: It is not legal for the company to do that, unless you transfer the copyright to them or give them an exclusive license (which CC-BY-SA isn't). It is not possible, under US law, to transfer a naked "right to sue" for copyright infringement, according to the Ninth Circuit. If you, personally, are not prepared to sue to defend your copyright, then your copyright functionally doesn't exist.
    – Kevin
    Commented May 19, 2024 at 23:07

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .