Internet Search Engines





The chapters "Dictionary building process" and "Documents fingerprint extraction and compression" describe a process that can be considered reliable only if applied to very large documents as well as books or manuals or detailed technical documentation. The document fingerprint built and compressed with this process cannot be meaningful if the document has a little size. The most documents contained in Internet are little HTML documents and then this process cannot be applied in order to build a new generation of Internet Search Engines. We need an alternative manner to assign a context fingerprint to documents. This new process should be reliable for little documents as well as larger ones. The choice has been to fix the granularity of context to 64 categories in order to be compatible with the maximum dimension of the ZISC vector, and then decide previously the meaning (context or argument) of any category. The fingerprint of a documents is now a 64 elements vector that can be computed counting inside the document the words belonging to the 64 specialized vocabularies. This process is described in fig.1 and fig.2: the output is absolutely equal to that one obtained with the previous process. The example of output table in the chapter "Documents fingerprint extraction and compression" is also the output that we obtain with the new process. All the sequent processes are exactly the same described in this document. The 3-D space in which we could navigate embeds 64-dimensions clustering as in the previous process but now the meaning of any embedded dimension is more clear. The main objection to this approach could be the limited granularity of context/argument clustering but really 64 arguments are enough to build useful space clustering also if applied in a global arguments space without any kind of constraints as Internet could be. Nevertheless specialized Internet Search Engines (for example related to technical documentation) can improve the reliability and usefulness of this approach.

fig.1

fig.2
 

 LEONARD HOME PAGE