Duplicate Entries in Kagi Assistant Sources

roach

Bug: If multiple searches are performed for a query, the sources for the Assistant might contain duplicates.
Caveat: Since I cannot see the sources and their text snippets directly, I am relying here on the model's judgment. Since three searches were performed in the example below and the highest observed duplicate source had three entries, I assume that this bug is due to the multiple searches that were performed, but I haven't done more testing.
Link to sample thread: https://kagi.com/assistant/221f5d62-addd-4310-ac52-d19698bdff5b
Model: Qwen 3 235b with Thinking

Detailed Description

In the thread above, three searches are performed for the initial query:

Searching with Kagi:
Deep Think Gemini 2.5 Pro
Gemini 2.5 Pro features
Gemini 2.5 Pro capabilities

When asked how many sources were provided, the model starts to ponder within the thinking block whether some sources count as distinct sources:

Wait, some of these might be duplicates. For example, indices 17 and 19 from TechTarget seem similar but have different URLs. Let me check:

Index 17: https://www.techtarget.com/whatis/feature/Google-Gemini-25-Pro-explained-Everything-you-need-to-know
Index 19: https://www.techtarget.com/WhatIs/feature/Google-Gemini-25-Pro-explained-Everything-you-need-to-know

The URLs are slightly different (whatis vs WhatIs in the path), but the content might be the same. However, they are separate entries, so they might be different articles. Similarly, others may have similar URLs but different paths.

I then directly asked the model to confirm that the URLs were exactly the same, which it confirmed.

Finally, I asked the model whether the provided text snippets were also the same. According to the model, at least some of the duplicate URLs also have identical (or nearly identical) text snippets.

Since the Assistants' answer quality often strongly hinges on the availability of good sources, it would be desirable if text snippets were deduplicated and additional text snippets could move up into the top 30 of search results.

In addition, it might be a good idea to not only look for exact duplicates, but also for semantic duplicates that might have different wording but carry the same informational content, based on some measure like cosine similarity, if embeddings for the text snippets already exist.