Documents fingerprint
extraction and compression

In order to extract from a document a "fingerprint" that identify it as arguments content we must build an histogram of the content for any word of the dictionary: the fingerprint of the document will be a vector with the dimension of the dictionary. This process will be quickly performed reading any word of the document and identifying its byte-vector with an Identify() operation on the ZISC loaded with the synaptic memory representing the dictionary. This procedure is shown in fig.2.
 
 


fig.2

Now we have a N-dimensional vector for any document of the collection, where N is the dimension of the dictionary and is really too large to be used for any successive identification (N =~ 10000). It is required a compression of the vector which doesn't affect similitude between couples of vectors, or more precisely, which doesn't affects the euclidian distances between any possible couple of vectors.
In the WEBSOM work(Kohonen) a compression based on vertically normally distributed random bits matrix was used in order to compress the dictionary vector(see fig.3).
We want use the power of ZISC to perform the compression task. We have  10000 components vectors and we need to have equivalent(*) <=64 components vectors(**). The compression is performed using a matrix (with 64 randomly distributed bits "1" in each row) that builds a 64 components vector as result of an AND operation with the original vector: this vector is then recognized by ZISC previously trained with fixed random patterns and the matching class is one component of the new compressed vector. This step must be repeated with different boolean vectors for any component of the new compressed vector (fig.4/5). This compression is more reliable than the first one:
one of the reasons can be intuitive as explained in fig.6/7.
 
 

IN 0
IN 1
IN 2
IN 3
IN 4
IN 5
IN 6
IN 7
IN 8
IN M
OUT 0
0
1
0
0
0
1
0
0
1
0
OUT 1
1
0
0
0
0
1
1
0
0
0
OUT 2
0
0
1
0
1
0
0
1
0
0
OUT 3
1
0
0
0
1
0
0
0
1
0
OUT N
0
1
0
1
0
0
0
0
0
1

fig.3 RANDOM BITS MATRIX MULTIPLICATION
(FIXED NUMBER OF "1" IN EACH ROW)

compression m-n: out[n] = sum(0-M){in[m] * BIT[n][m]}



 
 
 

IN 0
IN 1
IN 2
IN 3
IN 4
IN 5
IN 6
IN 7
IN 8
IN M
OUT 0
0
1
0
0
0
1
0
0
1
0
OUT 1
1
0
0
0
0
1
1
0
0
0
OUT 2
0
0
1
0
1
0
0
1
0
0
OUT 3
1
0
0
0
1
0
0
0
1
0
OUT N
0
1
0
1
0
0
0
0
0
1

fig.4 RANDOM BITS MATRIX RECOGNITION
(FIXED NUMBER(=64) OF "1" IN EACH ROW)

compression m-n: out[n] = Class(Vector(in[m] & BIT[n][m]))
 
 


fig.5
 


fig.6
 


fig.7



The new vectors conserve the similitude property better than the random bits matrix processed vectors. This property is more true if the documents contain an high number of words of the dictionary: the system behaves better working on database of large documents that little ones, as well as any other text-analysis algorithm.
The test to verify the similitude-safe property is following explained:

- a 10000 components vector is chosen

- it is compared using a normal program(distance calculation(****)) with all the other vectors and all the vectors having distance < D are inserted in a list.
This process is very long and could be performed quickly using an appropriate algorithm on the ZISC but because of it is used only one time for test our choice was to use a simple program.

- the equivalent 50 components vector is quickly compared (using ZISC) with all the other vectors and the vectors having distance < D are inserted in a list.
The equivalence of the two lists should means the 100% similitude property safe, while we found a value 94%. This result is fully acceptable for our purpose. This step need some more explanations: the ZISC must be trained with MIF=MAF=D with the single target vector and after recognition using Identify(vector) is performed for any other vector. The vectors identified (distance < D from prototype) are inserted in the list.

- this test should be repeated with more target vectors and finally the average of the results should be the final result. Also after this step the average result has been in a range from 93% to 97%.
 
 

NOTES:
(*) equivalence = same normalized distance between couples of vectors: we do not use euclidian distance but the more simple sum of the differences of all the vector components, in order to have compatibility with the behavior of ZISC.

(**) the final vector should be recognized by ZISC (64 components byte wide comparison ) in one single recognition operation.

(***) these are 256 64-components patterns randomly generated.

(****) it is not really euclidian distance but the normalized sum of the distance for any component:
this is made in order to have compatibility with the distance calculated by ZISC.
 
 



Examining the output of this process

The output of this process is a list of records composed of [URL][VECTOR][MAINWORD] related to the examined documents. The URL is Uniform Resource Locator for the specified document while VECTOR and MAINWORD are respectively the fingerprint and the most used word (MUW) of the document self.

EXAMPLE OF OUTPUT TABLE:
 

URL
VECTOR 
MUW
http://domainX.dir1.docZ.html [230][234][200]...[010] transistor
http://domainY.dir1.docW.html [240][134][002]...[030] optical
http://domainZ.dir1.docY.html [130][034][005]...[050]  car
http://domainZ.dir2.docD.html [030][114][135]...[120] network
.......... .......... ..........

 
 
 
 
 

LEONARD HOME PAGE