Level_0 clustering
When we have a table with records
Document_path|Vector|Word we must put them in clusters basing this
operation on distance between their vectors (we have called sometime them
as "document fingerprint").
We want perform this process as
fast as possible using the recognition power of a Radial Basis Function
hardware implementation.
The choice of ZISC neural chip (Zero
Instruction Set Computer) is related to the easy to understand behavior
and programming interface and mainly for it's "unlimited" expandability
at no cost of performance.
The process of clustering is an
hybrid situation where some aspects of supervised and unsupervised learning
behaves together. The learning process must be performed without knowing
a-priori classes associated with patterns, but at the same time it is needed
to associate a class to any cluster. The choice is to associate a serial
number linked with the commitment of a new prototype neuron in the Radial
Basis Function neural network. MIF (Minimum Influence Field) and MAF (Maximum
Influence Field) should be selected following the relation:
MIF = f ( DB_SIZE, NN_MAX_SIZE )
MAF = f ( DB_SIZE, NN_MAX_SIZE )
When a new prototype neuron is committed
due to the fact it's pattern doesn't match within influence field of any
other existing prototype, it will have directly MAF as influence field.
The edge MIF will have a mean when
the number of documents inside a specific cluster exceed a fixed
edge MCS (Maximum Cluster Size) which match the measure of clustering roughness.
When a pattern is learned both the
URL (Uniform Resource Locator) of the document and his most used word (MUW)
are memorized in a record of a database whose key is simply the number
of the cluster. Note that this is a relation "one to many" due to the fact
that many URLs with the associated MUW can be contained in a cluster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TABLE 1:
example of records for the key CLUSTER
= 23403 in DB_CLUSTER_0
Clustering process from documents
to level_0 clusters
Level_N clustering
The upper clustering levels are needed
in order to supply a hierarchical structure for the navigation of the database.
The number of clustering levels
is function of database dimension and the degree of roughness which best
fits a meaningful navigation. Starting from Level_1, a cluster of Level_n
doesn't contain URLs but numbers of the Level_(n-1) clusters. A MUW list
is associated with any Level_(n-1) cluster.
In order to enable a 3D navigation
we need to add at this record some information which represent the x-y-z
position of the cluster in a 3D space. We perform this operation with a
recognition operation on the vector divided in 3 elements, using ZISC trained
with predefined RANDOM patterns:
X = CLASS(k = 0 - m STEP 3) { V[k]
}
Y = CLASS(k = 1 - m STEP 3) { V[k]
}
Z = CLASS(k = 2 - m STEP 3) { V[k]
}
The picture
at the bottom of this chapter explains the x-y-z calculation process.
|
|
|
|
|
|
|
|
|
|
[120][030]...[240] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TABLE 2:
example of records for the key CLUSTER
= 2366 in DB_CLUSTER_n
Clustering process from level n-1
to level n clusters
X-Y-Z calculation process