In order to extract from a document
a "fingerprint" that identify it as arguments content we must build an
histogram of the content for any word of the dictionary: the fingerprint
of the document will be a vector with the dimension of the dictionary.
This process will be quickly performed reading any word of the document
and identifying its byte-vector with an Identify() operation on the ZISC
loaded with the synaptic memory representing the dictionary. This procedure
is shown in fig.2.
fig.2
Now we have a N-dimensional vector
for any document of the collection, where N is the dimension of the dictionary
and is really too large to be used for any successive identification (N
=~ 10000). It is required a compression of the vector which doesn't affect
similitude between couples of vectors, or more precisely, which doesn't
affects the euclidian distances between any possible couple of vectors.
In the WEBSOM work(Kohonen) a compression
based on vertically normally distributed random bits matrix was used in
order to compress the dictionary vector(see fig.3).
We want use the power of ZISC to
perform the compression task. We have 10000 components vectors and
we need to have equivalent(*) <=64 components vectors(**). The compression
is performed using a matrix (with 64 randomly distributed bits "1" in each
row) that builds a 64 components vector as result of an AND operation with
the original vector: this vector is then recognized by ZISC previously
trained with fixed random patterns and the matching class is one component
of the new compressed vector. This step must be repeated with different
boolean vectors for any component of the new compressed vector (fig.4/5).
This compression is more reliable than the first one:
one of the reasons can be intuitive
as explained in fig.6/7.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
fig.3 RANDOM BITS MATRIX MULTIPLICATION
(FIXED NUMBER OF "1" IN EACH
ROW)
compression m-n: out[n] = sum(0-M){in[m] * BIT[n][m]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
fig.4 RANDOM BITS MATRIX RECOGNITION
(FIXED NUMBER(=64) OF "1" IN
EACH ROW)
compression m-n: out[n] = Class(Vector(in[m]
& BIT[n][m]))
fig.5
fig.6
fig.7
The new vectors conserve the similitude
property better than the random bits matrix processed vectors. This property
is more true if the documents contain an high number of words of the dictionary:
the system behaves better working on database of large documents that little
ones, as well as any other text-analysis algorithm.
The test to verify the similitude-safe
property is following explained:
- a 10000 components vector is chosen
- it is compared using a normal program(distance
calculation(****)) with all the other vectors and all the vectors having
distance < D are inserted in a list.
This process is very long and could
be performed quickly using an appropriate algorithm on the ZISC but because
of it is used only one time for test our choice was to use a simple program.
- the equivalent 50 components vector
is quickly compared (using ZISC) with all the other vectors and the vectors
having distance < D are inserted in a list.
The equivalence of the two lists
should means the 100% similitude property safe, while we found a value
94%. This result is fully acceptable for our purpose. This step need some
more explanations: the ZISC must be trained with MIF=MAF=D with the single
target vector and after recognition using Identify(vector) is performed
for any other vector. The vectors identified (distance < D from prototype)
are inserted in the list.
- this test should be repeated with
more target vectors and finally the average of the results should be the
final result. Also after this step the average result has been in a range
from 93% to 97%.
NOTES:
(*) equivalence = same normalized
distance between couples of vectors: we do not use euclidian distance but
the more simple sum of the differences of all the vector components, in
order to have compatibility with the behavior of ZISC.
(**) the final vector should be recognized by ZISC (64 components byte wide comparison ) in one single recognition operation.
(***) these are 256 64-components patterns randomly generated.
(****) it is not really euclidian
distance but the normalized sum of the distance for any component:
this is made in order to have compatibility
with the distance calculated by ZISC.
Examining the output of this process
The output of this process is a list of records composed of [URL][VECTOR][MAINWORD] related to the examined documents. The URL is Uniform Resource Locator for the specified document while VECTOR and MAINWORD are respectively the fingerprint and the most used word (MUW) of the document self.
EXAMPLE OF OUTPUT TABLE:
|
|
|
|
| http://domainX.dir1.docZ.html | [230][234][200]...[010] | transistor |
| http://domainY.dir1.docW.html | [240][134][002]...[030] | optical |
| http://domainZ.dir1.docY.html | [130][034][005]...[050] | car |
| http://domainZ.dir2.docD.html | [030][114][135]...[120] | network |
| .......... | .......... | .......... |