ZISC is here used in order to create
the dictionary containing all the meaningful words for arguments discrimination.
This procedure is a substitution of a complex hash algorithm. The ZISC
can compare the current word with all the stored words in parallel mode
without use of hash algorithms and then increasing the counter associated
with the firing neuron category. In order to make it possible we must set
MIF=0 and MAF=0 (MInimum actual influence Field and MAximum actual influence
Field) because of we need a perfect match between pattern and prototype(words
must be equal and not similar).
Two counters are associated to any
neuron: the first counts the number of times the word has been found in
all documents, while the second counts the number of documents in which
the word has been found. Finally basing on these information we can decide
which words are meaningful in order to discriminate between arguments.
These counters are external to the ZISC and are implemented as vectors
on program. A procedure flow can be viewed in fig.1.
fig. 1
When we have stored the words on
the ZISC(s) memory we need to optimize the dictionary, removing words
not meaningful enough for arguments discrimination.
Two rules are used in order to perform
this operation:
ARGUMENT_DISCRIMINATION_MEANINGFUL(WORD(CLASS))
= TRUE
if both the following conditions
are verified
a) (MIN < TCNT[CLASS] < MAX) = TRUE
EXPLAINATION:
If TCNT < MIN word is used too
much few times to be
associated with a context.
If TCNT MAX word is probabilistically
evaluated as a common use word and thus it hasn't context discrimination
value.
b) (MIN < DCNT[CLASS] < MAX) = TRUE
EXPLAINATION:
MIN = total_number_of_documents_analized
/ ( G * W )
The context granularity of the database
is forced to G and thus if the number of documents having at least one
element of the specified word is lower than W% of the mean dimension of
one category, the word is considered not valid for context (category) discrimination.
MAX = total_number_of_documents_analized
* C / G
If the number of documents having
at least one element of the specified word is matching more than C categories,
the word is considered not valid for context (category) discrimination.
where
TCNT = TOTAL_COUNTER
DCNT = DOCUMENT_COUNTER
CLASS = IS THE CLASS ASSOCIATED
TO THE SPECIFIC WORD
When the boolean vector of meaningful words is completed it is possible to optimize the memory of ZISC removing not utilized prototypes (those ones associated with a false flag). We can make it knowing that
CLASS = NEURON_INDEX
because of CLASS is assigned to a
new committed neuron, during learning, putting as class the sequential
index of the word. We have built an algorithm that shift down the memory
of ZISC in any position is needed (from the position to top) up of
the required size correcting the associated classes. We have accessed in
this case the ZISC in SR mode (Save/Restore).
Now we can save the synaptic memory
of the ZISC on a file and use it to recognize words from documents.
The number of neurons utilized (equal
to the number of the valid words) is then the dimension of the vector that
will be utilized as "fingerprint" of documents.
DICTIONARY CONTEXT LINEARITY
If the documents are already clustered
by argument or source, and clusters are logically ordered, it will be probable
to have such a degree of context linearity in the dictionary. This property
can be an important improvement of the behavior conditions of the document
vector compression when the documents map will be built.