Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is now proven that copilot returns code from codebases with non-permissive licenses [1].

I'm curious - what are the legal implications of this going forward? I've so many questions.

1. Will Microsoft ever face lawsuits for these license violations?

2. If so, who/how? Class-action?

3. Will copilot be forced to open-source in the future? Under which license? Some open source licenses are incompatible with others, but copilot uses code from probably every OSS license conceived.

4. If Microsoft faces no justice, will we start seeing more OSS license violations? Will Google start using AGPL-licensed code?

[1] https://news.ycombinator.com/item?id=27710287 | Copilot regurgitating Quake code



That regurgitated code exists on Github exists under an MIT license: https://github.com/jethrodaniel/fast_inv_sqrt

"jethrodaniel" does not appear to have the copyright to offer that license, but it's hard for Github to determine that in general, so I doubt they would be liable for the error.


Even if it's somehow available under an MIT license (which is questionable on the part of jethrodaniel), there's still infringement. MIT isn't public domain, it still has

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Replicating it without complying with those terms is still infringement.


this. People are being willfully blind here, like cult members looking dead-eyed at their leader and chanting "This is great" as they drink the kool-aid.

And from Microsoft no less, once outcast for mass poisoning.


> but it's hard for Github to determine that in general, so I doubt they would be liable for the error.

Please insert that meme, "That's not how that works. That's not how any of this works!"

The legal system is permission based, not forgiveness or "I didn't know" based.


Actually the legal system is evidence based. Microsoft has evidence that the code they are producing is licensed under MIT as far as they can reasonably know. There's no definitive way to know that who actually owns the original copyright. I could grant permission to use my repo, but maybe I got that code from someone else, who then got it from someone else and so on and so forth. It's a similar situation with stolen goods, if you unknowingly purchase stolen goods you usually cannot be charged for theft as long as there aren't obvious signs that it's stolen such as the goods being priced far below market value.


Microsoft has evidence that the code they are reproducing is MIT licensed, so are they intentionally violating that license or does this AI thing include the license and attribution in every snippet it generates?


Major aspects of copyright infringement are strict liability, like a lot of civil actions around damages. It doesn't matter if you thought it was OK, there's still a damaged party that needs compensation according to the law. At best you'll simply avoid the criminal and punitive penalties.


Exactly, that's why Pornhub hasn't had any liability issues arising from where its content comes from either. It's just too darned hard to tell.


No, PornHub doesn't have liability in a lot of cases because of 17 § 512, but has still had to deal with liability in general, which is why they nuked some 80% of their library not backed by verified individuals a while back.

https://www.law.cornell.edu/uscode/text/17/512

A huge part of 17§512 is the DMCA takedown process mainly in 17§512(c)(3). Does Microsoft even have the ability to truly remove training data from the model? Or do they have to retrain on each DMCA takedown?


I personally don't want to have to upload proof of identity to GitHub and a signed document swearing that I own the copyright to all the code I upload to GitHub, or proof that I coded it. We need to be careful what we wish for.


Excerpt from the MIT license:

> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.


If they had a reasonable basis for believing they had a license they're in the clear. "I didn't know" might not be enough but "I had good reasons to think otherwise" is.


> If they had a reasonable basis for believing they had a license they're in the clear.

False.

If they committed copyright infringement, even if they genuinely believed they weren't, they are not in the clear. They still owe damages.


Can I have a citation?



I’m not a lawyer but my understanding these are torts so all you have to prove is Microsoft has liability. I think this would be easy to prove due to the way neural networks work since it’s just a way of performing a search.

Since it’s a tort I don’t think you have to prove they should have know it would return copyrighted code, the fact that it does is enough to have liability.


That doesn't stop youtube from blasting people away over copyright issues?

On youtube, video uploads are a cost center, whereas on github, code is a profit center


IANAL. My understanding is that the general legal precedent in the US is that a) datamining text has no copyright implications (in the same way that reading a book has no copyright implications) and b) it is not a copyright violation to use a small amount of copyrighted material provided the context is sufficiently transformative. This might seem silly or unfair to you, but that is the current legal reality.

But even ignoring that, everybody uploading code to GitHub has given GitHub the right to analyze that code as per the GitHub ToS. This is the same mechanism by which you can't upload code to GitHub with a license that says "nobody is allowed to display this code on the internet" and then sue GitHub.


I can't imagine a scenario in which any lawyer would consider granting Github the right to "analyze" code anywhere close to granting Github the right to spit out that same code verbatim without your copyright notice (even if laundered by AI).


Here's Kate Downing, an IP lawyer specializing in software license:

> According to Downing, the answer depends to a certain extent on where that code is hosted. If it’s on GitHub, there very clearly would not be copyright infringement.

> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”

Downing cautions that copilot output of large chunks of code complete with comments are more questionable to use, but that for the most part it looks above board.

https://fossa.com/blog/analyzing-legal-implications-github-c...

Here's an English lawyer on the same topic...

> The licence is broadly worded, and I'm confident that there is scope for argument, but if it turns out that Github does not require a licence for its activities then, in respect of the code hosted on Github, I suspect it could make a reasonable case that the mandatory licence grant in its terms covers this as against the uploader.

https://decoded.legal/blog/2021/06/github-copilot-initial-th...


To me regardless if it is technically legal, it certainly doesn’t feel right. Furthermore, contracts rely on people understanding what they are agreeing to, and I don’t think many developers would agree to letting the code be used outside the terms of the license they uploaded it under.

I am very surprised there hasn’t been a legal challenge to it.


What, exactly, is there to challenge?

“I’m sorry your honor I didn’t understand what I was signing” I don’t think has ever been a valid reason in a courtroom, similar to “I’m sorry I didn’t know I was committing a crime” is not a valid defense.


Courts interpret the intended and understood meaning of contracts and terms all the time. Research the term "meeting of the minds" and case law around it.

When the terms were written, it's exceedingly unlikely that they intended it or anyone understood it to be blanket permission to allow a trained AI to copy code for others and no user would have interpreted it that way. Microsoft/Github can't necessarily unilaterally increase the intended range without making it clear in the terms.

If it got to a court case, and both sides could afford it, it could be a lengthy one.

(This comment is not legal advice. I am not a lawyer.)


How does "[allowing] a trained AI to copy code" change the interpretation of the ToS?

By uploading your code, you give Github an exclusive license to use it to improve their services. Copilot is such a service. Just because it's an AI and it provides others code does not somehow invalidate the license you gave.


Again, research "meeting of the minds". It's a standard legal term directly relevant to all contracts and terms. Also, "transparency" is another important one.

Many online services have very wide terms around what they can do with your data, which most people who bother to read them interpret as being what is required for them to handle the service for you without breaking copyright law. In that context, being able to use and analyse your data to improve their services could be another catch-all that lets them do specific performance optimisation on their backend.

One party instead deciding they've got blanket permission to do whatever they like with your work, including selling it to others, may well not hold up in court.

Contracts aren't programs and one party tricking the other rarely works out in court - courts world-wide tend to rule against trickery and deception.


> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”

That's assuming that all code on GitHub is uploaded in good faith by the copyright owner, which is not always going to be the case.


Many repositories on Github were put there by people that do not own the copyright and never agreed to GitHub's Terms of Service.

Linux, for example, does not require copyright assignment. The original contributor of a change owns the copyright for that code and may have never used Github.


There's also one more question:

5. Even if it is illegal, is it actually bad? No one can possibly sell code snippets, the transaction costs are many orders of magnitude greater than any reasonable price. In my opinion, at least in this case the benefits massively outweigh the costs and the law should not apply here.


I really, REALLY like the idea of Copilot. I think it is a glance at what the future of AI can bring to improve programming. I understand where all the litigation and "uneasiness" is coming from, both from commercial and open-source projects.

I've not installed or used it for the same reason (don't want to use AGPL or GPLd code by accident, and don't want my closed source code to be used accidentally as well), but the thought of Copilot being "killed" due to litigation/copyright/licensing issues is sad.

For me, It's kind of like when MP3 first appeared: Sharing music in Napster or downloading Mp3s from Geocities was just amazing. The idea of having such things at your fingertips. Even though I understood the issue the authors had with the unpaid distribution of their music... still, the idea of "what could be..." made it amazing.

I guess Microsoft could be a bit forward thinking, and implement the "Spotify" model in code: Pay OpenSource developers (whoever owns the repo, or whoever made a commit?) a small amount whenever their code gets used through Copilot.

I'm super excited by how "Copilot" related services will look like in 10 years. And I really really hope that the technology/idea doesn't get killed by litigation.


Microsoft could have trained this on their own code and there would be no issue. The problem is instead of doing that they knew full well the approach would reproduce the code and they decided they would rather breach GPL than expose their own code. But I bet Microsoft has more than enough lines to train an AI, there was a clear choice to breach other peoples licenses in preference.


Huh... These comments have given me an idea: MS needs to be forced to train a model to compensate (pay) code authors and codebases based on snippet suggestions given by their tool: the Spotify model replacing Napster!


See: Who Owns the Future by Jaron Lanier


The comment you replied to gave you that idea nearly word for word..


huh you're right! sorry I haven't been sleeping well lately :)


Some people won’t let you use their copyrighted work no matter how much you pay, that’s reasonable.

By all means allow repos to opt in, although if it’s licensed under something like GPL there’s no way to convert it to non gpl without permission from every contributor. I for one am not interested in Microsoft or anyone else paying me to close my code.

Allowing people to pay $xxx to copy my copyrighted work without my agreement is simple piracy.

Either they international agreement to drop copyright as a concept, or obey the law.


> Some people won’t let you use their copyrighted work no matter how much you pay, that’s reasonable.

Is it really? From a consequentialist utilitarian perspective?


Yes, otherwise you’re saying rich people can break the law with no consequences.

The alternative would be to completely remove copyright, which would be ok.


Then the law should change. Saying "it's illegal but it's good/harmless" is a terrible stance.


Seems like an eminently reasonable stance, and exactly the stance you would take to get the law changed.


Fair. I had read "and the law should not apply" as "so we ignore it", not "so we change it".


Of course it's bad. Noone who put up their work as open source wants some huge company taking it and selling it to get even more competitive advantage and influence in the world. And that's without mentioning the people who put that into their license pretty much explicitly. Taking GPL code and getting away with it is a failure of our justice system, and that can't be made right with throwing pennies at developers.


In this case it seems obvious to me that the huge time savings for thousands of developers outweigh the fact that the original writers are offended.

A world with copiloted snippets seems like a better one to live in than a world without.


Is there any leaked Microsoft code on GitHub? Someone should check if Copilot regurgitates that as well, then see how Microsoft reacts when someone slaps an AGPL license on that…


> Is there any leaked Microsoft code on GitHub?

There seems to be. Google 'windows nt source code leak github':

https://www.google.com/search?q=windows+nt+source+code+leak+...

First search results:

Windows NT 4.0:

> https://github.com/lianthony/NT4.0

> https://github.com/ZoloZiak/WinNT4

Windows XP:

> https://github.com/tongzx/nt5src

> https://github.com/onein528/NT5.1


It seems like Microsoft could be in the clear on the basis of it being essentially "search". But it also seems like anyone who uses it could be risking to a high degree getting infected with copyright violating code.

My question is, if it isn't a copyright infringement issue to use copilot in its current form right now, why not just claim copilot was used whenever accused of copyright infringement hence forth?


> why not just claim copilot was used whenever accused of copyright infringement hence forth?

Without speaking to the particulars of copilot, this situation where laws seem toothless because of the ease of plausible deniability is actually fairly common. And in many such cases, the law is not as toothless as it seems, because

1. Getting multiple people to stick to a script under oath is difficult and dangerous.

2. Criminals frequently send each other messages like

A: "lol I just crimed, hope nobody figures it out."

B: "lol just say you used copilot".

A: "lolol yeah fuck the law"

Obviously this only gets the worst criminals, but there seems to be lots and lots of them.


Microsoft is trying to legally position Copilot like StackOverflow. It is possible to post copyright-infringing code on SO even though their TOS requires a CC BY-SA 4.0 grant to the company and its users.

https://stackoverflow.com/legal/terms-of-service#licensing


> It is now proven that copilot returns code from codebases with non-permissive licenses [1].

That same Quake example from last year is repeated every single time.

Aside from the fact that GitHub has since added a protection for this, that this example gets repeated time and time again instead of a *list of examples leads me to believe this is (and was not) a common occurrence.


1) Most likely

2) TBD

3) Not likely. Worst case a judgement will go against them, they'll effectively pay a fine and then they'll retrain it on a more restricted set of source code.

4) OSS has a pretty tragic history re: enforcement. It wins nearly every skirmish but has no interest in the war so from a big picture standpoint, it loses due to apathy.


You don't think a mountain of MSFT lawyers in every state, including partner law firms around the world haven't thought about this? Do you practice law or are you speculating based on emotions?


MSFT tried very hard to sue Linux into oblivion. Buying SCO, then claiming they owned all of Linux. http://www.groklaw.net/

I trust MSFT to screw everyone over.


You're making stuff up. MSFT never bought SCO.

https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...


Not sure where your confidence came from but a Google of “sco Microsoft” reveals:

By the mid-1980s Microsoft had gotten out of the Unix business, except for its ownership stake in SCO.[20]

https://en.m.wikipedia.org/wiki/History_of_Microsoft


No, SCO was found in 2002, from Candera Software who was a Linux distributor [0]. How could Microsoft in 1980s own a company that wasn’t founded until 2002?

They later filed for bankruptcy in 2007.

[0] https://en.m.wikipedia.org/wiki/SCO_Group


Close, but you have the wrong incarnation of SCO

SCO was founded in 1979 by Larry Michels and his son Doug Michels.

https://en.wikipedia.org/wiki/Santa_Cruz_Operation


Not sure where yours is coming from, if we look at [20] it makes no such claim.

https://web.archive.org/web/20061105100939/http://www.inform...


https://en.wikipedia.org/wiki/Santa_Cruz_Operation

They were a partner with Microsoft maintaining a version of Xenix


That doesn't imply ownership and the article [20] that you pointed out doesn't make a specific claim. All good, but MSFT never fully owned or operated SCO at any level is the point I'm trying to make.


You aren't really saying anything at all. SCO would never have existed without Microsoft and Microsoft had a very significant stake in their business and gave it direction.


Owning stock, on its own, is not the same as buying a company


If you own a controlling percentage. Then yes it is. That is how you buy/control a publicly traded company.

You can buy 100% of shares and take it private, but that's overkill for what Microsoft wanted.


Hence why I said “on its own”


lmao


Big meh. That quake code was MIT.


A) Public Quake is GPL. Just because someone else dumped it in an MIT library doesn't change that.

B) MIT still requires attribution to not infringe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: