Dictionary building process

ZISC is here used in order to create the dictionary containing all the meaningful words for arguments discrimination. This procedure is a substitution of a complex hash algorithm. The ZISC can compare the current word with all the stored words in parallel mode without use of hash algorithms and then increasing the counter associated with the firing neuron category. In order to make it possible we must set MIF=0 and MAF=0 (MInimum actual influence Field and MAximum actual influence Field) because of we need a perfect match between pattern and prototype(words must be equal and not similar).
Two counters are associated to any neuron: the first counts the number of times the word has been found in all documents, while the second counts the number of documents in which the word has been found. Finally basing on these information we can decide which words are meaningful in order to discriminate between arguments. These counters are external to the ZISC and are implemented as vectors on program. A procedure flow can be viewed in fig.1.
 
 


fig. 1

When we have stored the words on the ZISC(s) memory we need to optimize the dictionary, removing words  not meaningful enough for arguments discrimination.
Two rules are used in order to perform this operation:

ARGUMENT_DISCRIMINATION_MEANINGFUL(WORD(CLASS)) = TRUE
if both the following conditions are verified

a)   (MIN < TCNT[CLASS] < MAX) = TRUE

EXPLAINATION:
If TCNT < MIN word is used too much few times to be
associated with a context.
If TCNT MAX word is probabilistically evaluated as a common use word and thus it hasn't context discrimination value.

b)   (MIN < DCNT[CLASS] < MAX) = TRUE

EXPLAINATION:
MIN = total_number_of_documents_analized / ( G * W )
The context granularity of the database is forced to G and thus if the number of documents having at least one element of the specified word is lower than W% of the mean dimension of one category, the word is considered not valid for context (category) discrimination.
MAX = total_number_of_documents_analized * C / G
If the number of documents having at least one element of the specified word is matching more than C categories, the word is considered not valid for context (category) discrimination.

where

TCNT = TOTAL_COUNTER
DCNT = DOCUMENT_COUNTER
CLASS = IS THE CLASS ASSOCIATED TO THE SPECIFIC WORD
 
 
 

When the boolean vector of meaningful words is completed it is possible to optimize the memory of ZISC removing not utilized prototypes (those ones associated with a false flag). We can make it knowing that

CLASS = NEURON_INDEX

because of CLASS is assigned to a new committed neuron, during learning, putting as class the sequential index of the word. We have built an algorithm that shift down the memory of ZISC in any position is needed (from the  position to top) up of the required size correcting the associated classes. We have accessed in this case the ZISC in SR mode (Save/Restore).
Now we can save the synaptic memory of the ZISC on a file and use it to recognize words from documents.
The number of neurons utilized (equal to the number of the valid words) is then the dimension of the vector that will be utilized as "fingerprint" of documents.
 
 

DICTIONARY CONTEXT LINEARITY

If the documents are already clustered by argument or source, and clusters are logically ordered, it will be probable to have such a degree of context linearity in the dictionary. This property can be an important improvement of the behavior conditions of the document vector compression when the documents map will be built.
 
 

LEONARD HOME PAGE