HexesofVexes ,

I mean, here is a thought, if an AI tool uses creative commons data, then it's derivatives fall under creative commons. I.e. stop charging for AI tools and people will stop complaining.

drunkpostdisaster ,

This shit scares me. It will become so easy to rewrite history from here. Just delete anything you don't like and have an ai rewrite into whatever you want. Entire threads rewritten, a company can go back and have your entire post history can be changed in ways that might be legally compromising.

baseless_discourse , (edited )

This is a violation of GDPR, no?

EDIT: user created content is not directly protected under GDPR, only personally identifiable data is pertected under GDPR.

lemmyreader OP ,

Dunno. GDPR is a Europe only thing, and isn't it only related to how your private data (like name, IP address, phone number) is cared about ?

AccountMaker ,

Right, I think it only covers personal information: companies can only collect what they need to run their service, users can request to see their data etc. I don't think it applies to comments and posts.

TachyonTele ,

How so?

baseless_discourse ,

User should have the right to delete their data stored by the company.

flux ,

Would that kind of provision allow me to have my code removed from a git repository history, if that git repository is hosted by a company?

baseless_discourse ,

I am not a lawyer, but I believe in general, yes.

Git is not even that convoluted, as all the history is stored in the .git folder within the repo. Unless there is some convoluted structure built on top, they would only need to move the repo folder to a trash disk, waiting to be formated.

That being said, GDPR is somewhat poorly enforced at the moment, unfortunately. I don't know if you can sue the company and expect some result within couple of years.

refalo ,

No because user generated content is not protected.

interdimensionalmeme ,

As long as you didn't give those rights by signing a CLA or a copyleft license.
Never sign a CLA unless you're fully compensated.

WldFyre ,

Doesn't that just mean the data would have to be anonymized ?

baseless_discourse ,

I am not a expert or a lawyer, but I believe user actually hold the right to completely erase personal data:

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay

https://gdpr.eu/right-to-be-forgotten/

Note the word "erasure" as opposed to "anonymize"

WldFyre ,

I don't think that addresses my point. Is my opinion on the new Star Wars movies that I post online or some lines of code I suggest "personal data"? I thought personal data had a specific definition under GDPR

nefonous ,

You're totally right, the content of your posts is not considered personal data (because it isn't)
It's more about profiling data that can be connected back to your actual person

Spaenny ,
@Spaenny@discuss.tchncs.de avatar

Technically, they could retain posts from users if they are irreversibly anonymized. However, ensuring with 100% certainty that none of your posts ever contained any personal data that could lead to the identification of you as an individual is challenging. The safest option is therefore to also delete your posts.

baseless_discourse ,

I think you are right, user generated content doesn't seem to be protected. This is surprising to me, as user should hold the right to their content, which in my mind should enjoy stronger protection than personal data.

TachyonTele ,

That only applies to personal data.

refalo ,

How does GDPR get away with not defining what a website is when referring to them directly in the law? Like what counts, only html? http? ftp? gopher?

fluxc0 ,

This feels a little iffy to me. it rings of what happened with reddit.

darkphotonstudio ,

I think people would have less issues with AI training if it was non-profit and for the common good. And there are open source AI projects, many in fact. But yeah, these deals by companies like this are sleazy.

NeatNit ,

OpenAI was literally that until it wasn't

darkphotonstudio ,

I don't think OpenAI actually released any FOSS code, did they?

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

Up until GPT3 they were quite open. When GPTs became good, they started claiming sharing the models would be risky and that there were ethical problems and that they would safekeep the technology. I believe they were even sued by one of their investors for sticking to their open mission at some point.

The source code they would provide would be pretty useless to most people anyway, unless you have a couple million laying around to spend on GPUs.

Plenty of AI companies do what OpenAI did, without ever sharing any models or writing any papers. We only hear about the open stuff. We see tons of open source AI stuff on Github that's all mostly based on research by either Google or OpenAI. All the Llama stuff exists only because Facebook shared their model (accidentally). All of this stuff is mostly open, even if it's not FOSS.

Compare that to what companies are doing internally. You bet data brokers and other shady shits are sucking up as much data as they can get their hand on to train their own, specialised AI, free from the burdens of "as an LLM I can't do that".

delirious_owl ,
@delirious_owl@discuss.online avatar

This isn't really comparable to reddit, since users can just send a request to SO for all the content. Reddit locking down the API meant we lost access to our content.

FenrirIII ,
@FenrirIII@lemmy.world avatar

If you get something for free, you are the product

delirious_owl ,
@delirious_owl@discuss.online avatar

Like AI doesn't know how to use the way back machine?

mhzawadi ,
@mhzawadi@lemmy.horwood.cloud avatar

Why delete the answer, why not edit it so that a human can see the answer but for AI its a load of nonsense?

chicken ,

There's no way that would work either, they can just store the full edit history and auto-curate as needed.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

If that would happen, I assume companies would just grab an older copy of the dumps from before people started editing their stuff because of the AI bullshit.

SA would ban everyone sabotaging their business plans and things would move on like normal, like what happened to Reddit.

zovits ,

Editing any content to reduce its quality is considered vandalism and gets reverted on SO.

gjoel ,

People did that. Stack overflow reverted the change.

mhzawadi ,
@mhzawadi@lemmy.horwood.cloud avatar

So we need to up vote wrong answers only?

hagar ,

StackOverflow: *grabs money on monetizing massive amounts of user-contributed content without consulting or compensating the users in any way*

Users: *try to delete it all to prevent it*

StackOverflow: *your contributions belong to the community, you can't do that*

Pretty fucked-up laws. A lot of lawsuits going on right now against AI companies for similar issues. In this case, StackOverflow is entitled to be compensated for its partnership, and because the answers are all CC BY-SA 3.0, no one can complain. Now, that SA? Whatever.

9point6 ,

That SA part needs to be tested in court against the AI models themselves

A lot of this shittiness would probably go away if there was a risk that ingesting certain content would mean you need to release the actual model to the public.

hagar ,

Yeah, their assumption though is you don't? Neither attribution nor sharealike, not even full-on all-rights-reserved copyright is being respected. Anything public goes and if questions are asked it's "fair use". If the user retains CC BY-SA over their content, why is giving a bunch of money to StackOverflow entitling OpenAI to use it all under whatever terms they settled on? Boggles me.

Now, say, Reddit Terms of Service state clearly that by submitting content you are giving them the right to "a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness (...) in all media formats and channels now known or later developed anywhere in the world." Speaks volumes on why alternatives (like Lemmy) to these platforms matter.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

The funny thing about Lemmy is that the entire Fediverse is basically running a massive copyright violation ring with current copyright law. The license bit every web company has in their terms exists because Facebook wouldn't have the right to show your holiday pictures to your grandma otherwise. The pictures are your property, and just because you uploaded them doesn't mean Facebook has the right to redistribute them. Cropping off the top and bottom to fit it into the timeline? That's a derivative work, they'd need to ask permission or negotiate a license to show that!

The Fediverse runs without any such clauses and just presumes nobody cares about copyright. Which they don't, because the whole thing is based on forwarding all data to everyone.

Nobody is going to sue a Lemmy server for sending their comment to someone else, because there's no money behind any of the servers. Companies like Facebook need to get their shit together, though, because they have large pools of investor money that any shithead with a good lawyer can try to claim, and that's why they have legal disclaimers.

hagar ,

That's interesting. I was looking up "Lemmy Terms of Service" for comparison after getting that quote from the Reddit ToS and could not find anything for Lemmy.ml. Now after you mentioned it, looking on my Mastodon instance, nothing either, just a privacy policy. That is indeed kinda weird. Some instances do have their own ToS though. At least something stating a sublicense for distribution should be there for protection of people running instances in locations where it's relevant.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

The thing with many of these services is that they're not run by companies with a legal presence, just by some guy(s) who do it for fun. For many laws, personal projects are considered differently compared to business/organisational endeavours.

It's the same thing with personal blogs lacking a privacy policy: the probability of the thing becoming an actual problem in the real world is so abysmally low that nobody bothers, and that's probably okay.

During the first wave of some troll uploading child abuse to various Fediverse servers (mostly Lemmy), a lot of server operators got a rough wake-up call, because suddenly they had content on their servers that could land them in prison. There has been an effort to combat this abuse for larger servers, but I don't think most Lemmy servers run on the Nvidia hardware that's strong enough to support the live CSAM detection code that was developed.

hedgehog ,

The funny thing about Lemmy is that the entire Fediverse is basically running a massive copyright violation ring with current copyright law.

Is it, though?

When someone posts a comment to Lemmy, they do so willingly, with the intent for it to be posted and federated. If they change their mind, they can delete it. If they delete it and it remains up somewhere, they can submit a DMCA request; likewise if someone else posts their copyrighted content.

Copyright infringement is the use of works protected by copyright without permission for their use. When you submit a post or a comment, your permission to display it and for it to be federated is implied, because that is how Lemmy works. A license also conveys permission, but that’s not the only way permission can be conveyed.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

The idea that someone does this willingly implies that the user knows the implications of their choice, which most of the Fediverse doesn't seem to do (see: people asking questions like "how do I delete comments on a server I've been defederated from", or surprised after finding out that their likes/boosts are inherently public).

If the implied license was enough, Facebook and all the other companies wouldn't put these disclaimers in their terms of service. This isn't true in every jurisdiction, but it does apply to many important ones.

I agree that international copyright law should work like you imply, but on the other hand, this is exactly why Creative Commons was created: stuff posted on the internet can be downloaded just fine, but rehosting it is not allowed by default.

This is also why I appreciate the people who put those Creative Commons licenses on their comments; they're effectively useless against AI, which is what they seem to be trying to combat, but they do provide rights that would otherwise be unavailable.

Just like with privacy laws and data hosting laws, I don't think the fediverse cares. I think the fediverse is full of a sort of wilful ignorance about internet law, mostly because the Fediverse is a just a bunch of enthusiastic nerds. No Fediverse server (except for Threads, maybe) has a Data Protection Officer despite sites like lemmy.world legally requiring one if they'd cared about the law, very little Fediverse software seems to provide DMCA links by default, and I don't think any server is complying with the Chinese, Russian, and European "only store citizen's data in locally hosted servers" laws at all.

hedgehog ,

The idea that someone does this willingly implies that the user knows the implications of their choice, which most of the Fediverse doesn't seem to do

The terms of service for lemmy.world, which you must agree to upon sign-up, make reference to federating. If you don’t know what that means, it’s your responsibility to look it up and understand it. I assume other instances have similar sign-up processes. The source code to Lemmy is also available, meaning that a full understanding is available to anyone willing to take the time to read through the code, unlike with most social media companies.

What sorts of implications of the choice to post to Lemmy do you think that people don’t understand, that people who post to Facebook do understand?

If the implied license was enough, Facebook and all the other companies wouldn't put these disclaimers in their terms of service.

It’s not an implied license. It’s implied permission. And if you post content to a website that’s hosting and displaying such content, it’s obvious what’s about to happen with it. Please try telling a judge that you didn’t understand what you were doing, sued without first trying to delete or file a DMCA notice, and see if that judge sides with you.

Many companies have lengthy terms of service with a ton of CYA legalese that does nothing. Even so, an explicit license to your content in the terms of service does do something - but that doesn’t mean that you’re infringing copyright without it. If my artist friend asks me to take her art piece to a copy shop and to get a hundred prints made for her, I’m not infringing copyright then, either, nor is the copy shop. If I did that without permission, on the other hand, I would be. If her lawyer got wind of this and filed a suit against me without checking with her and I showed the judge the text saying “Hey hedgehog, could you do me a favor and…,” what do you think he’d say?

Besides, Facebook does things that Lemmy instances don’t do. Facebook’s codebase isn’t open, and they’d like to reserve the ability to do different things with the content you submit. Facebook wants to be able to do non-obvious things with your content. Facebook is incorporated in California and has a value in the hundreds of billions, but Lemmy instances are located all over the world and I doubt any have a value even in the millions.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

AI companies are hoping for a ruling that says content generated from a model trained on content is not a derivative work. So far, the Sarah Silverman lawsuit seems to be going that way, at least; the claimants were set back because they've been asked to prove the connection between AI output and their specific inputs.

If this does become jurisprudence or law in one or more countries, licenses don't mean jack. You can put the AGPL on your stuff and AI could suck it up into their model and use it for whatever they want, and you couldn't do anything about it.

The AI training sets for all common models contains copyright works like entire books, movies, and websites. Don't forget that most websites don't even have a license, and that that unlicensed work is as illegal to replicate as any book or movie normally would be, including internet comments. If AI data sets need to comply with copyright, all current AI will need to be retrained (except maybe for that image AI by that stock photo company, which is exclusively trained on licensed work).

hagar ,

the claimants were set back because they’ve been asked to prove the connection between AI output and their specific inputs

I mean, how do you do that for a closed-source model with secretive training data? As far as I know, OpenAI has admitted to using large amounts of copyrighted content, numberless books, newspaper material, all on the basis of fair use claims. Guess it would take a government entity actively going after them at this point.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

The training data set isn't the problem. The data set for many open models is actually not hard to find, and it's quite obvious that works by the artists were included in the data set. In this case, the lawsuit was about the Stable Diffusion dataset, and I believe that's just freely available (though you may need to scrape and download the linked images yourself).

For research purposes, this was never a problem: scientific research is exempted from many limitations of copyright. This led to an interesting problem with OpenAI and the other AI companies: they took their research models, the output of research, and turned them into a business.

The way things are going, I expect the law to be like this: datasets can contain copyrighted work as long as they're only distributed for research purposes, AI models are derivative works, and the output of AI models is not a derivative work, and therefore the output AI companies generate is exempt of copyright. It's definitely not what I want to happen, but the legal arguments that I thought would kill this interpretation don't seem to hold water in court.

Of course, courts only apply law as it is written right now. At any point in time, governments can alter their copyright laws to kill or clear AI models. On the one hand, copyright lobbyists have a huge impact on governance, as much as big oil it seems, but on the other hand, banning AI will just put countries that don't care about copyright to get an economic advantage. The EU has set up AI rules, which I appreciate as an EU citizen, but I cannot deny that this will inevitably lead to a worse environment to do business in compared to places like the USA and China.

hagar ,

Thank you for sharing. Your perspective broadens mine, but I feel a lot more negative about the whole "must benefit business" side of things. It is fruitless to hold any entity whatsoever accountable when a whole worldwide economy is in a free-for-all nuke-waving doom-embracing realpolitik vibe.

Frankly, not sure what would be worse, economic collapse and the consequences to the people, or economic prosperity and... the consequences to the people. Long term, and from a country that is not exactly thriving in the scheme side of things, I guess I'd take the former.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

It's a tough balance, for sure. I don't want AI companies to exist in the form they currently are, but we're not getting the genie back into the bottle. Whether the economic hit is worth the freedom and creative rights, that I think citizens deserve, is a matter of democratic choice. It's impossible to ignore the fact that in China or Russia, where citizens don't have much a choice, I don't think artistic rights or the people's wellbeing are even part of the equation. Other countries will need a response when companies from these countries start doing work more efficiently. I myself have been using Bing AI more and more as AI bullcrap is flooding every page of every search engine, fighting AI with AI so to speak.

I saw this whole ordeal coming the moment ChatGPT came out and I had the foolish hope that legislators would've done something by now. The EU's AI Act will apply March next year but it doesn't seem to solve the copyright problem at all. Or rather, it seems to accept the current copyright problem, as the EU's summary put it:

Generative AI, like ChatGPT, will not be classified as high-risk, but will have to comply with transparency requirements and EU copyright law:

  • Disclosing that the content was generated by AI
  • Designing the model to prevent it from generating illegal content
  • Publishing summaries of copyrighted data used for training

The EU seems to have chosen to focus on combating the immediate threat of AI abuse, but seem to be very tolerant of AI copyright infringement. I can only presume this is to make sure "innovation" doesn't get impeded too much.

I'll take this into account during the EU vote that's about to happen soon, but I'm afraid it's too late. I wish we could go back and stop AI before it started, but this stuff has happened and now the world is a little bit better and worse.

bitfucker ,

Yep. Can't wait to overfit LLM to a lot of copyrighted work and share it to public domain. Let's see if OpenAI will get push back from copyright owner down the road.

i_am_not_a_robot ,

Why now? Other people have been profiting off of your Stack Overflow answers for years. This is nothing new.

wuphysics87 ,

Those answers were given in good faith under the presumption that they would be read and used by another person. Not used to train something to remove the interactions which motivated the answer in the first place.

jsomae ,

Can you elaborate on what you mean by "remove the interactions which motivated the answer in the first place"? I'm not sure I follow.

Piemanding ,

People like being social and having discourse online. Probably what brought you here in the first place.

forgotmylastusername ,

The internet had a social contract. The reason people put effort into brain dumping good posts is because the internet was a global collaborative knowledge base for everybody.

Of course there were always capitalists who sought to privatize and profit from resources. The source materials were generally part of the big giant digital continuum of knowledge. For the parts that weren't there we're anarchists who sought to free that knowledge for anyone who wanted to access it.

AI is bringing about the end of all this as platforms are locking down everything. Old boards and forums had already been shuttering for years as social media was centralizing everything around a few platforms. Now those few platforms are being swallowed up by AI where the collective knowledge of humanity is being put behind paywalls. People no longer want to work directly for the profit of private companies.

Capitalists can only see dollar signs. They care not for the geological epoch scale forces of nature required to form petroleum. All that matters is can it all be sold and how quickly. Nor do they care for environmental damages they cause. In the same way the AI data mining do not care for the digital ecological disaster they are causing.

More over it's a thought terminating cliche when someone says, "<thing> existed before so why's it suddenly a problem?". It seems to be yet another out of the bag of rhetorical tricks that wipes the slate of discourse clean. As if all the arguments against it suddenly need to be explained as if none of it had any validity. Not only that but the OPs are often seemingly disingenuously naive. It provides the OP with a blank slate to continually "just ask questions". Where every response is "but why?" which forces their interlocutors to keep on elaborating in excruciating detail to the point where they give up trying to explain minutiae. Thus the OP can conclude by default they were correct that it's not a problem after all because they declare nobody has provided them with answers to their satisfaction.

mbirth ,

Currently, all answers are properly attributed. But once OpenAI will have trained and sell a “hackerman” persona, do you really think it will answer people’s questions with ”This answer was contributed by i_am_not_a_robot” or will it just sell this as its own answer?

Taleya ,

As a tech, i'm fucking howling because 99% of answers to any given question is already bullshit that ranges from useless to dangerous.

"The machine" can't tell the difference and it's going to be considered authoratitive in its blithe stupidity. hoover up SA all you want, you're just gonna agregate it with bullishit and poison your own well anyway

haui_lemmy ,
@haui_lemmy@lemmy.giftedmc.com avatar

Simple answer: people vs corporations. A dev or homelabber getting help from you is very different from a company making billions just by mass shoveling your knowledge to the highest bidder.

The reason we need this as a fediverse service is that everyone can take in this knowledge and one corp doesnt have the ability to sell it. Thats what the worth comes from. Someone holding they key to it.

i_am_not_a_robot ,

That's not what I mean. When you contribute content to Stack Exchange, it is licensed CC BY-SA. There are websites that scrape this content and rehost it, or at least there used to be. I've had a problem before where all the search results were unanswered Stack Overflow posts or copies of those posts on different sites. Maybe similar to Reddit they restricted access to the data so they could sell it to AI companies.

haui_lemmy ,
@haui_lemmy@lemmy.giftedmc.com avatar

Maybe similar to Reddit they restricted access to the data so they could sell it to AI companies.

Which would be a way to circumvent cc by sa.

skullgiver ,
@skullgiver@popplesburger.hilciferous.nl avatar

Nobody smart scrapes them, they provide full dumps for you to download: https://data.stackexchange.com/

Modern_medicine_isnt ,

So what is the stack overflow replacement?

Weslee ,
cupcakezealot ,
@cupcakezealot@lemmy.blahaj.zone avatar

that would be great if they federated and implemented activitypub/atproto!

cupcakezealot ,
@cupcakezealot@lemmy.blahaj.zone avatar

let's all go back to experts exchange

dukatos ,

Expert sex change?

jubilationtcornpone ,

Data Rule Numero Uno:

Garbage in, garbage out.

Have fun training your LLM on a big steaming pile of hot garbage. That's 80% of Stack Overflows content.

harrys_balzac ,

Mostly "this has been answered in another thread" and "why don't you Google it" comments in my experience.

DarkDarkHouse ,
@DarkDarkHouse@lemmy.sdf.org avatar

Can’t wait until the top answer to every Google search is “just google it”

LostXOR ,

The other 20% is mostly high quality however, and I'm sure they'd filter out the heavily downvoted crud.

mnemonicmonkeys ,

You say that as if the garbage gets downvoted

mnemonicmonkeys ,

One time I was went on there to figure out an issue in Arduino. The answer one guy gave was "I don't know how to do this in Arduino, here's how you do this in Java". Not only the the mods prevent any other answers from being posted, I tried the guy's suggestion in Java and it didn't even work

helenslunch ,

Would be a shame if someone used ChatGPT to generate bad answers and a short script to resubmit them back to Stackoverflow. So awful.

zovits ,

SO has mechanisms in place to filter out AI-generated content.

helenslunch ,

I don't believe that.

zovits ,
helenslunch ,

This says nothing about filtering mechanisms

zovits ,

Ah, I think I got the source of misunderstanding: these mechanisms are not automated, but implemented as moderation guidelines and rules.

henfredemars ,

I feel like this content craze is going to evaporate soon because all the new content from here forward is sure to be polluted by LLM output already. AI is fast becoming a snake eating its own tail.

That reminds me. I should go update my licenses to spit in the face of AI training companies.

Sibbo ,

Does GDPR apply to stackoverflow? Since my data there probably does not identify me as a person?

delirious_owl ,
@delirious_owl@discuss.online avatar

You van delete your data but I don't think it magically makes derivative works disappear. Its licenses SA. This is good.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • technology@lemmy.ml
  • kamenrider
  • Rutgers
  • jeremy
  • Lexington
  • cragsand
  • mead
  • RetroGamingNetwork
  • loren
  • steinbach
  • xyz
  • PowerRangers
  • AnarchoCapitalism
  • WatchParties
  • WarhammerFantasy
  • supersentai
  • itdept
  • AgeRegression
  • mauerstrassenwetten
  • MidnightClan
  • space_engine
  • learnviet
  • bjj
  • Teensy
  • khanate
  • electropalaeography
  • neondivide
  • Mordhau
  • fandic
  • All magazines