Have Kagi index the content of all the books on the internet

carl

Just another thought: There are a large amount of "links" to books and publications online right now in Wikipedia, with the references on all articles. How that should be interpreted and connected by a search engine I'll leave to the professionals. Aside from this, the books themselves usually contain references when they are fact and not fiction. Again, difficult to process for a computer, but there might be a way.

Edit: More: Weighing book chapter titles higher than content could be a way to improve search results from books. When a chapter title matches with your query, the chapter content should probably be relevant and of high quality.

Vlad

carl The main problem remains relevance.

Suppose you search for the phrase "blue widgets". How would Kagi now to surface a book from anna's archive vs a website. Where and how to rank it?

carl

Vlad

To start everything, "Books" could be a separate tab in Kagi along with "Images", "Videos", "Podcasts" etc

Soon enough, the search engine could start showing book snippets among normal search results, just as images and videos etc are already mixed in there. The inner workings of a search engine are unfamiliar to me, so I can unfortunately not give a more technical answer.

But as I see it, E-Books could be surfaced in the same way as any web site. You have paragraphs and headings. You don't have any hard links online yet to content in e-books - yet. Once search engines and web browsers start working better with e-books, people will become better at linking to that content. Just as people right now are linking to timestamps and chapters in YouTube videos.

Even without anybody linking to the content in an e-book, a search engine should be able to index and surface the content when it's relevant to the user. I have several websites that nobody else links to, with a ton of real people visiting, that are very well surfaced by Google, Kagi and Bing. I always assumed page rank was dead, since the proof is in the pudding.

Let me give an example for my vision of better e-book integration:

If I search for "South! book", today Kagi shows these results:

Wikipedia
Goodreads
Amazon
"Interesting finds"
5-8. Irrelevant results
The book in a "pirate" repository.

In my opinion the first result should be for example this:
https://standardebooks.org/ebooks/ernest-shackleton/south/downloads/ernest-shackleton_south.epub

Or at least very high among the results. So that could be a start: Surfacing e-books when people search for the book title. But the optimal thing being the search engine treating the content of E-Books like the content on any website when indexing and ranking.

carl

Vlad

I'd like to return to this question. It seems that it is actually doable to build something like pagerank for books and PDFs. Wikipedia articles are extensibly sourced, and the key is ISBN numbers. Many citations on Wikipedia reference the ISBN number, and shadow libraries such as Anna's Archive lets you search by that ISBN number and find the book or paper. The references on Wikipedia even have page numbers, but it seems to be a pain to get this to work right.

Could there be a way to make the search engine understand Wikipedia references where no ISBN is given? For a person it is trivial to find the books referenced by the article, but can a machine do that automatically?

The way I see it, there's really no difference between a URL and an ISBN that makes one better than the other for building an index. And maybe I'm dreaming, but wouldn't it be nice to be able to enter isbn:// in the browser just in the same way you enter https:// ? What's keeping that from happening? There are services that will parse those numbers.

What if the search engine and web browser could have links to content like this: ISBN://123456789/chapterY/headingX

The W3 recently updated their recommendations for epub on this page:
https://www.w3.org/TR/epub-33/

It is a bit too technical for me, I tried to understand if they already have a standard for linking to within books, but I couldn't make sense of the Docs.

Alright, let's take a look at what options are available today to search within books:

1. Google
You have to specifically use the "Books" tab for it to surface any books. The results are excellent, with book name, author name, publishing year and a snippet of the relevant text, with page number. Clicking the result takes you to a book preview hosted by Google, where you might or might not be allowed to see the content. The book preview is presented pretty much like a PDF, with good performance and bad accessibility. You have links to try find the book in a library or try to buy it online. When you exit the preview, Google shows you a splendid page about the book and author.

2. Z-Library
They offer fulltext search of a pretty bad quality. It is evident that they have indexed the content of a ton of their books (if not all), but their search feature is set by default to surface "Most Popular" results instead of "Best match". However, changing that doesn't improve results much. Results are presented with a thumbnail, title, author, relevant text snippets, year, language, and a link to the book (or PDF) presented very nicely on Z-Library, where you can also download it.

3. J-Stor
Limited content, I couldn't find anything relevant using their fulltext search, but it seems to be presented nicely.

4. Open Library (Internet Archive)
Quick and pretty good fulltext search, books are presented with links to read or to borrow. However, most material is very old, since they focus on stuff out of copyright. They boast 4 million books to search within.

And these are all resources to search within books that I could find. The prominent shadow libraries Library Genesis and Anna's Archive will only search for book titles, not the content.

In the Orion feedback forum, I've suggested as a feature for the browser to open ebooks, but there is also the option of going the Google route and have Kagi open some kind of web-view with the book inside. I assume it will take some time before browsers in general will start treating ebooks nicely.

For people looking for content in books today, the best option is to use Googles book search and preview, and then acquire the book in some way using the ISBN with a third party, such as bookstore, library, or shadow library. Can Kagi beat Google?

E-Books have somewhat been embraced by the tech giants. Google don't want anything to do with ePub files, but have indexed books and made an excellent search feature. Apple have an ePub reader included on MacOS and iOS devices. Click an ePub in Safari on your iPhone and it opens without an issue. Microsoft used to view ePub files natively in Microsoft Edge, until they switched to Chromium. I think I don't need to mention Amazon's involvement in e-books 😉

I know Kagis main base right now are programmers, and that the content they need is usually not found in books. But there are many different sectors of knowledge workers who would be much better served by having search results from books and PDFs than from web pages. Engineers, medical professionals, historians, journalists, and all kinds of people who do research professionally or for fun.

Annas Archive provide torrents with their whole library, which can be useful for making an index. Unfortunately I wouldn't be able to download them, since it requires a ton of storage.

I think it would be possible for Kagi to massively improve the availability of the world's knowledge, without too much work.

Vlad

carl Having Books as a separate tab could indeed solve this problem.

carl

A good place to start experimenting crawling the content of E-Books could be the Standard Ebooks library, which are high quality files of classic books, that are out of copyright:
https://standardebooks.org/bulk-downloads

As for the shadow libraries, I notice that Anna's Archive links to the straight .ePub files without jumping through IPFS (Inter planetary file system), which should mean they can be easier to crawl, right?

If Kagi would present some snippets from indexed books in results, what would be the best way to actually link to the book? Should it be just linking to the source ePub, like any other online resource? Most books are on shadow libraries, and I think these source URLs change quite frequently. Should the link be the ISBN number and leave the rest to the user to figure out? Should the link be a hash like in IPFS, which seems to soon be better integrated with browsers and such?

carl

I apologize for bumping my thread, but I seriously think there is an untapped gold mine in crawling and indexing e-books. We're all here because we wanted something better than the spammy SEO-web promoted by Google. With AI writing, it's only going to get worse. The one example that always pops up when discussing this, is how difficult it is to find a simple recipe online when you want to cook something. Almost every person have experienced this, so I'd like to show an example here:

First result on Kagi when searching for "lentil soup":
https://www.recipetineats.com/lentil-soup/

Below a screenshot from an E-Book full of soup recipes:

The complete E-Book is smaller than the average size of tracking javascripts loaded to see a single recipe on a modern website.

Vlad

carl Btw I suggest using Kagi's "Recipes' lens for finding recipes.

carl

Speaking of the Devil... Just some time after I posted the above, Anna's Archive published a blog post detailing their new standard for dealing with their e-book datasets, which if I understand will make it easier for bulk processors to index the content (they mention Elastisearch):

https://annas-blog.org/annas-archive-containers.html

At 420 terabytes in size, the library should be manageable for Kagi to index, right?

carl

Here is my humble suggestion on how book search results could be presented to the user, in much the same style as the podcast search results. If Kagi could hook into the fulltext search of book content that Google has, that could be a stepping stone until there comes a day when Kagi or somebody else makes a more complete index. Google fulltext search inside books is of fine quality as it is, usually surfacing good answers.

The buttons under each result would use the ISBN to find the book in different services:

Google Books for their snippets or full books
Amazon for reviews and purchase
Anna's Archive for metadata and full books (AA includes the WorldCat index for finding libraries with the books)
Goodreads for reviews and social reading

In my suggestion book titles that are in the Standard Ebook project are specially highlighted. This open source project collects and offers public domain books for download, in the process formatting the books and fixing typography for the books to be of the highest quality for e-readers. The books from this project could be a good starting point for Kagi to experiment with indexing books, since they are free and public domain.

So what do you say, fellas?

kagi-book-search.pdf

321kB

CrunchyFritos

This is a great idea. Thanks carl for the passion and for pushing this. I will think about ways to offer constructive feedback, but clearly having some sort of book search or ranking or lens or AI helper or something would do a lot to help "humanize the internet." Books are such a rich source of high quality information and will probably survive the upcoming onslaught of AI-generated SEO fluff longer than much of the rest of the web.

carl

CrunchyFritos

Thank you! I think we're not far away from starting to integrate books better online. I really hope Kagi will be part of that.

smartbeach

archive.org also has a lot of books, and they have a search: https://archive.org/details/books

Maybe an integration with them would make sense too, if possible?

slamor

It's been a year and still no change?

carl

More sources for books and documents: https://www.gutenberg.org , https://www.hathitrust.org

« Previous Page