13

A large portion of mankind's most important and most unique content has been written into books – long before the internet, radio or TV existed, as well as during the existence of all these media into this very day. I think many of us use Kagi because of the machine generated content took over Google's search results.

Right now there is a vast treasure of e-books on websites such as Library Genesis and Anna's Archive, the later boasting about more than 14 million books and 86 million papers. Does Kagi have any plans for indexing the content of the books in these libraries, so that people can find the information contained in them?

The search results would only need to let the user know from which book the snippet comes, and then it is up to the user herself to purchase the book, borrow it from a library, or obtain it in other ways.

I can not think of a better way to humanize the web.

  • Vlad replied to this.

    I think the idea is great but your title is misleading. It seems to assume that Kagi already does this. As a result, this thread might not really get the traction you are looking for.

    • carl replied to this.

      blosh Good point, I will try to change my title. Edit: I hope the new title is better.

        carl changed the title to Have Kagi index the content of all the books on the internet .

          carl Interesting idea thanks. Can you give a few use cases for searches that would be much improved if we had this? are you perhaps looking for a way to "talk" to these books instead (with an AI)?

          • carl replied to this.

            It might be worth checking if Google allows their book scans to be usable by corporate entities. Replacement Documents would be nice to look into supporting, as they specialize in manuals and guides for old software. This latter source would be good for giving the AI some training when it comes to recognizing different fonts, text in pictures, and so forth.

            Definitely want Kagi AI to be able to look at online books and give responses about them. There is Gamefaqs, which has text guides for many games, sometimes with multiple people making separate observations on the subject matter. Having Kagi AI look at each guide for a game would be handy.

            Mhm. Come to think of it, maybe collaboration with Fantasy Grounds to offer Kagi as an aide might be good to pursue someday? Fantasy Grounds is a virtual tabletop software, used to play roleplaying games. They sell digital books filled with rules and lore, which in turn are usable within the software. Having the Kagi AI being a search engine for those owned books within Fantasy Grounds would be very useful, and spread the word about Kagi.

            Vlad Not really. I'm looking for books being treated as first-class search results and be judged by the relevance of their content in the same way as a web page.

            Practically I think the best presentation would be a snippet of the relevant text, with the book and author name, maybe chapter if possible.

            As for an example, let's take the easiest, a search query relating to the Bible: "The parable of the talents". The Kagi results are fine, but I think the actual quote with a reference to the book is the most desired result. Now it is second after Wikipedia.

            Another example query: "fur taxes in Kamchatka". The Kagi results is okay, with a relevant Wikipedia article, and the second result seems to be adequate as well. The third result is spam and the rest of the results seem OK. I think a snippet or quote from a book describing "fur taxes in Kamchatka" should be somewhere in the search results, with a reference to which book. Why not?

            For a lot of historical accounts, the web contains only references and derivative work, when the books have been there for a long time.

            Another query example: "Benjamin Franklin's first newspaper": I'd like to see a snippet from his own account about this from his autobiography in the search results.

            Another query example: "sinking of the endeavour" - Here again I'd like to have a snippet or reference to the first-hand account from any of the several books that were published by the people who saw it.

            Not to mention manuals, guides and instructions for everything under the sun. How about a tour guide to your favorite place from the year 1975? You will not find anything like that on the web, since that is info that needs to stay always current.

            Knowledge has travelled from spoken to letters to books to web and now with YouTube it is going back to spoken again. Indexing books and YouTube is to me as important as indexing the web. Past, present and future in a way.

            The web is not as deep as we like to think, because much of it is just repeating and reiteration of things published on other web pages. And this I write now before AI generated content is taking over. I think results could become vastly improved in almost all accounts by including books.

              carl

              I'm looking for books being treated as first-class search results and be judged by the relevance of their content in the same way as a web page.

              This is impossible to do as the main factor for relevance is link graph, and nobody links to a book page or even a book. When two books have a sentence
              "Sun is yellow" and "Sun is orange" you have no way of knowing which one is more relevant or should this result even be surfaced in user query for "what color is the sun".

              Instead I would imagine it would have to work something more like this:
              https://books.google.com/talktobooks/

              And I can imagine us having the same thing for podcasts, where we would attempt to transcribe all existing podcasts too.

              Can we list all sources where we would get books from?

                Vlad This is impossible to do as the main factor for relevance is link graph, and nobody links to a book page or even a book.

                Are you sure this is how it works still? I have launched several successful websites that are ranked high on Google and other search engines just on their content. Nobody has ever linked to them, but they rank high and get many visitors.

                Even so, maybe it would be possible to then have a separate search tab for books, that only looks at the content. Like search engines did before Google introduced Pagerank.

                As for sources, Anna's Archive combines all the big ebook libraries, so I think it is a good place to start. Since the .epub format is already normal text, maybe it could be a start to index only them and ignore PDFs for the while?

                E-Books are very simple files (basically XHTML and CSS), and I think they should be better integrated with not only search engines, but with browsers also. In the beginning of the WWW, browsers would only display HTML text. Then came images, then came video, then PDF - but we still don't have books? Can we change that? We could play Quake in the web browser i 2008...

                  Just another thought: There are a large amount of "links" to books and publications online right now in Wikipedia, with the references on all articles. How that should be interpreted and connected by a search engine I'll leave to the professionals. Aside from this, the books themselves usually contain references when they are fact and not fiction. Again, difficult to process for a computer, but there might be a way.

                  Edit: More: Weighing book chapter titles higher than content could be a way to improve search results from books. When a chapter title matches with your query, the chapter content should probably be relevant and of high quality.

                  • Vlad replied to this.
                    21 days later

                    carl The main problem remains relevance.

                    Suppose you search for the phrase "blue widgets". How would Kagi now to surface a book from anna's archive vs a website. Where and how to rank it?

                      Vlad

                      To start everything, "Books" could be a separate tab in Kagi along with "Images", "Videos", "Podcasts" etc

                      Soon enough, the search engine could start showing book snippets among normal search results, just as images and videos etc are already mixed in there. The inner workings of a search engine are unfamiliar to me, so I can unfortunately not give a more technical answer.

                      But as I see it, E-Books could be surfaced in the same way as any web site. You have paragraphs and headings. You don't have any hard links online yet to content in e-books - yet. Once search engines and web browsers start working better with e-books, people will become better at linking to that content. Just as people right now are linking to timestamps and chapters in YouTube videos.

                      Even without anybody linking to the content in an e-book, a search engine should be able to index and surface the content when it's relevant to the user. I have several websites that nobody else links to, with a ton of real people visiting, that are very well surfaced by Google, Kagi and Bing. I always assumed page rank was dead, since the proof is in the pudding.

                      Let me give an example for my vision of better e-book integration:

                      If I search for "South! book", today Kagi shows these results:

                      1. Wikipedia
                      2. Goodreads
                      3. Amazon
                      4. "Interesting finds"
                        5-8. Irrelevant results
                      5. The book in a "pirate" repository.

                      In my opinion the first result should be for example this:
                      https://standardebooks.org/ebooks/ernest-shackleton/south/downloads/ernest-shackleton_south.epub

                      Or at least very high among the results. So that could be a start: Surfacing e-books when people search for the book title. But the optimal thing being the search engine treating the content of E-Books like the content on any website when indexing and ranking.

                        A good place to start experimenting crawling the content of E-Books could be the Standard Ebooks library, which are high quality files of classic books, that are out of copyright:
                        https://standardebooks.org/bulk-downloads

                        As for the shadow libraries, I notice that Anna's Archive links to the straight .ePub files without jumping through IPFS (Inter planetary file system), which should mean they can be easier to crawl, right?

                        If Kagi would present some snippets from indexed books in results, what would be the best way to actually link to the book? Should it be just linking to the source ePub, like any other online resource? Most books are on shadow libraries, and I think these source URLs change quite frequently. Should the link be the ISBN number and leave the rest to the user to figure out? Should the link be a hash like in IPFS, which seems to soon be better integrated with browsers and such?

                          I apologize for bumping my thread, but I seriously think there is an untapped gold mine in crawling and indexing e-books. We're all here because we wanted something better than the spammy SEO-web promoted by Google. With AI writing, it's only going to get worse. The one example that always pops up when discussing this, is how difficult it is to find a simple recipe online when you want to cook something. Almost every person have experienced this, so I'd like to show an example here:

                          First result on Kagi when searching for "lentil soup":
                          https://www.recipetineats.com/lentil-soup/

                          Below a screenshot from an E-Book full of soup recipes:

                          The complete E-Book is smaller than the average size of tracking javascripts loaded to see a single recipe on a modern website.

                            carl Btw I suggest using Kagi's "Recipes' lens for finding recipes.

                              a month later

                              Vlad

                              I'd like to return to this question. It seems that it is actually doable to build something like pagerank for books and PDFs. Wikipedia articles are extensibly sourced, and the key is ISBN numbers. Many citations on Wikipedia reference the ISBN number, and shadow libraries such as Anna's Archive lets you search by that ISBN number and find the book or paper. The references on Wikipedia even have page numbers, but it seems to be a pain to get this to work right.

                              Could there be a way to make the search engine understand Wikipedia references where no ISBN is given? For a person it is trivial to find the books referenced by the article, but can a machine do that automatically?

                              The way I see it, there's really no difference between a URL and an ISBN that makes one better than the other for building an index. And maybe I'm dreaming, but wouldn't it be nice to be able to enter isbn:// in the browser just in the same way you enter https:// ? What's keeping that from happening? There are services that will parse those numbers.

                              What if the search engine and web browser could have links to content like this: ISBN://123456789/chapterY/headingX

                              The W3 recently updated their recommendations for epub on this page:
                              https://www.w3.org/TR/epub-33/

                              It is a bit too technical for me, I tried to understand if they already have a standard for linking to within books, but I couldn't make sense of the Docs.

                              Alright, let's take a look at what options are available today to search within books:

                              1. Google
                              You have to specifically use the "Books" tab for it to surface any books. The results are excellent, with book name, author name, publishing year and a snippet of the relevant text, with page number. Clicking the result takes you to a book preview hosted by Google, where you might or might not be allowed to see the content. The book preview is presented pretty much like a PDF, with good performance and bad accessibility. You have links to try find the book in a library or try to buy it online. When you exit the preview, Google shows you a splendid page about the book and author.

                              2. Z-Library
                              They offer fulltext search of a pretty bad quality. It is evident that they have indexed the content of a ton of their books (if not all), but their search feature is set by default to surface "Most Popular" results instead of "Best match". However, changing that doesn't improve results much. Results are presented with a thumbnail, title, author, relevant text snippets, year, language, and a link to the book (or PDF) presented very nicely on Z-Library, where you can also download it.

                              3. J-Stor
                              Limited content, I couldn't find anything relevant using their fulltext search, but it seems to be presented nicely.

                              4. Open Library (Internet Archive)
                              Quick and pretty good fulltext search, books are presented with links to read or to borrow. However, most material is very old, since they focus on stuff out of copyright. They boast 4 million books to search within.

                              And these are all resources to search within books that I could find. The prominent shadow libraries Library Genesis and Anna's Archive will only search for book titles, not the content.

                              In the Orion feedback forum, I've suggested as a feature for the browser to open ebooks, but there is also the option of going the Google route and have Kagi open some kind of web-view with the book inside. I assume it will take some time before browsers in general will start treating ebooks nicely.

                              For people looking for content in books today, the best option is to use Googles book search and preview, and then acquire the book in some way using the ISBN with a third party, such as bookstore, library, or shadow library. Can Kagi beat Google?

                              E-Books have somewhat been embraced by the tech giants. Google don't want anything to do with ePub files, but have indexed books and made an excellent search feature. Apple have an ePub reader included on MacOS and iOS devices. Click an ePub in Safari on your iPhone and it opens without an issue. Microsoft used to view ePub files natively in Microsoft Edge, until they switched to Chromium. I think I don't need to mention Amazon's involvement in e-books 😉

                              I know Kagis main base right now are programmers, and that the content they need is usually not found in books. But there are many different sectors of knowledge workers who would be much better served by having search results from books and PDFs than from web pages. Engineers, medical professionals, historians, journalists, and all kinds of people who do research professionally or for fun.

                              Annas Archive provide torrents with their whole library, which can be useful for making an index. Unfortunately I wouldn't be able to download them, since it requires a ton of storage.

                              I think it would be possible for Kagi to massively improve the availability of the world's knowledge, without too much work.

                              Speaking of the Devil... Just some time after I posted the above, Anna's Archive published a blog post detailing their new standard for dealing with their e-book datasets, which if I understand will make it easier for bulk processors to index the content (they mention Elastisearch):

                              https://annas-blog.org/annas-archive-containers.html

                              At 420 terabytes in size, the library should be manageable for Kagi to index, right?

                              3 months later

                              Here is my humble suggestion on how book search results could be presented to the user, in much the same style as the podcast search results. If Kagi could hook into the fulltext search of book content that Google has, that could be a stepping stone until there comes a day when Kagi or somebody else makes a more complete index. Google fulltext search inside books is of fine quality as it is, usually surfacing good answers.

                              The buttons under each result would use the ISBN to find the book in different services:

                              • Google Books for their snippets or full books
                              • Amazon for reviews and purchase
                              • Anna's Archive for metadata and full books (AA includes the WorldCat index for finding libraries with the books)
                              • Goodreads for reviews and social reading

                              In my suggestion book titles that are in the Standard Ebook project are specially highlighted. This open source project collects and offers public domain books for download, in the process formatting the books and fixing typography for the books to be of the highest quality for e-readers. The books from this project could be a good starting point for Kagi to experiment with indexing books, since they are free and public domain.

                              So what do you say, fellas?

                              kagi-book-search.pdf
                              321kB