In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn’t be able to sell it for that purpose.

  • elrik@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    1
    ·
    10 months ago

    No, because the upvote ratio on posts and comments will be used to signal higher quality content.

    It would take considerable effort and coordination to generate low quality content and give it an upvote history that isn’t obviously suspicious and do that for enough content that it actually matters to the training.

    Even if you could accomplish that, you can’t backdate this activity, so they could simply filter out posts and comments after a recent date and still have an enormous amount of data to train.

    • nodsocket@lemmy.worldOP
      link
      fedilink
      arrow-up
      3
      arrow-down
      1
      ·
      edit-2
      10 months ago

      Upvoted content is not higher quality. An AI trained only on the top posts of Reddit would be very funny though.

      They could filter posts by time, but that prevents any further data from being used which still limits the value of Reddit to buyers. Even all of Reddit pre-AI is probably too small to be useful indefinitely.

      • elrik@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        10 months ago

        If the goal of training is to produce output that users “like” or engage with, then yes, upvoted content is higher quality. The definition of quality here will certainly depend on their goals.

        My point is a bunch of spammed content intended to poison AI training is unlikely to gather upvotes, and so it could easily be filtered out if they’re also okay with discarding some human generated content that was not upvoted.