Kagi LLM appears to have been trained on unlicensed content

quinncom

Kagi is using a large language model which appears to have been trained on content that may not have been used with permission. I found this example, where the text generated by the LLM matches exactly a sentence in a copyrighted book.

Steps to reproduce:

Search for “Layman suggests focusing on a handful of important amino acids”.
The Kagi LLM returns this text (screenshot 1, below):

... but for those of us with day jobs, Layman suggests focusing on a handful of important amino acids, such as leucine, lycine, and methionine.

This text is identical (including the misspelling of “lycine”) to that found on page 260 of Outlive: The Science and Art of Longevity by Peter Attia, MD (see screenshot 2, below).

Expected behavior:
LLM should not be trained on unlicensed content?

Debug info:
Edge/macOS/International/US-EAST

Screenshots:

Search result:
Copyrighted book page showing identical contents:

This is either a bug, or a prompt for a serious discussion. 🫤

Vlad

quinncom That is not the output of an LLM, but answer surfaced from a search snippet (it is a verbatim copy from the snippet). As you know search snippets do not qualify for copyright issues.

LLMs usually do not copy the content verbatim. Kagi also does not use LLMs automatically. You need to request it on-demand by clicking Quick Answer link.

quinncom

Ok, now I see: it's a snippet extracted from the html of the first linked result.

I was confused by the help text behind the (i) button (“Kagi uses language models to answer your questions.”). Perhaps this help text could be updated to avoid this confusion? Mention that the answer comes from one of the pages linked in the results below?

Thanks!

Vlad

quinncom A language model (not an LLM) was used to answer the question, using the provided snippet. That does not mean it was trained on copyrighted material.