[Corpora-List] approximately total number of published texts and which percentage has been so far digitized ...

Albretch Mueller lbrtchx at gmail.com
Sat Dec 5 12:26:19 CET 2020


https://archive.org/post/1111547/approximately-total-number-of-published-texts-and-which-percentage-has-been-so-far-digitized

I have heard statements by google, microsoft, ... about how much text is apparently accessible over the Internet (mostly talking in terms of Giga/Petabytes ...).

https://www.worldwidewebsize.com/

https://searchengineland.com/googles-search-indexes-hits-130-trillion-pages-documents-263378

There are also estimates of the total number of books ever published: 129,864,880, which is not a large number at all.

Can anyone here answer or point me to a reliable source about such info?:

* is there a registry of the titles and other publication metadata about those books per language

* which of those books have been actually read by people socially over generations?

* total amount or percentage of those books which have been digitized

* from those books which have been digitized, which ones have been converted to searchable text?

When you try to get the text version of many of the pdf files at archive.org you realized that the pdfs are mostly either image-based and that the conversion was based on some tesseract kind of automation, so the quality of the texts is not that good (the least to say)

thanks, lbrtchx



More information about the Corpora mailing list