Warning
This documentation is work-in-progress and unorganized.
These functions cut hierarchical clusterings into flat clusterings or find the roots of the forest formed by a cut by providing the flat cluster ids of each observation.
fcluster(Z, t[, criterion, depth, R, monocrit]) | Forms flat clusters from the hierarchical clustering defined by |
fclusterdata(X, t[, criterion, metric, ...]) | Cluster observation data using a given metric. |
leaders(Z, T) | (L, M) = leaders(Z, T): |
These are routines for agglomerative clustering.
linkage(y[, method, metric]) | Performs hierarchical/agglomerative clustering on the condensed distance matrix y. |
single(y) | Performs single/min/nearest linkage on the condensed distance matrix y. |
complete(y) | Performs complete complete/max/farthest point linkage on the condensed distance matrix y. |
average(y) | Performs average/UPGMA linkage on the condensed distance matrix |
weighted(y) | Performs weighted/WPGMA linkage on the condensed distance matrix |
centroid(y) | Performs centroid/UPGMC linkage. See linkage for more |
median(y) | Performs median/WPGMC linkage. See linkage for more |
ward(y) | Performs Ward’s linkage on a condensed or redundant distance |
These routines compute statistics on hierarchies.
cophenet(Z[, Y]) | Calculates the cophenetic distances between each observation in |
from_mlab_linkage(Z) | Converts a linkage matrix generated by MATLAB(TM) to a new |
inconsistent(Z[, d]) | Calculates inconsistency statistics on a linkage. |
maxinconsts(Z, R) | Returns the maximum inconsistency coefficient for each non-singleton cluster and its descendents. |
maxdists(Z) | Returns the maximum distance between any cluster for each non-singleton cluster. |
maxRstat(Z, R, i) | Returns the maximum statistic for each non-singleton cluster and its descendents. |
to_mlab_linkage(Z) | Converts a linkage matrix Z generated by the linkage function |
Routines for visualizing flat clusters.
dendrogram(Z[, p, truncate_mode, ...]) | Plots the hiearchical clustering defined by the linkage Z as a |
These are data structures and routines for representing hierarchies as tree objects.
ClusterNode(id[, left, right, dist, count]) | A tree node class for representing a cluster. |
leaves_list(Z) | Returns a list of leaf node ids (corresponding to observation vector index) as they appear in the tree from left to right. |
to_tree(Z[, rd]) | Converts a hierarchical clustering encoded in the matrix Z (by |
These are predicates for checking the validity of linkage and inconsistency matrices as well as for checking isomorphism of two flat cluster assignments.
is_valid_im(R[, warning, throw, name]) | Returns True if the inconsistency matrix passed is valid. It must |
is_valid_linkage(Z[, warning, throw, name]) | Checks the validity of a linkage matrix. |
is_isomorphic(T1, T2) | Determines if two different cluster assignments T1 and |
is_monotonic(Z) | Returns True if the linkage passed is monotonic. The linkage |
correspond(Z, Y) | Checks if a linkage matrix Z and condensed distance matrix |
num_obs_linkage(Z) | Returns the number of original observations of the linkage matrix passed. |
[Sta07] | “Statistics toolbox.” API Reference Documentation. The MathWorks. http://www.mathworks.com/access/helpdesk/help/toolbox/stats/. Accessed October 1, 2007. |
[Mti07] | “Hierarchical clustering.” API Reference Documentation. The Wolfram Research, Inc. http://reference.wolfram.com/mathematica/HierarchicalClustering/tutorial/HierarchicalClustering.html. Accessed October 1, 2007. |
[Gow69] | Gower, JC and Ross, GJS. “Minimum Spanning Trees and Single Linkage Cluster Analysis.” Applied Statistics. 18(1): pp. 54–64. 1969. |
[War63] | Ward Jr, JH. “Hierarchical grouping to optimize an objective function.” Journal of the American Statistical Association. 58(301): pp. 236–44. 1963. |
[Joh66] | Johnson, SC. “Hierarchical clustering schemes.” Psychometrika. 32(2): pp. 241–54. 1966. |
[Sne62] | Sneath, PH and Sokal, RR. “Numerical taxonomy.” Nature. 193: pp. 855–60. 1962. |
[Bat95] | Batagelj, V. “Comparing resemblance measures.” Journal of Classification. 12: pp. 73–90. 1995. |
[Sok58] | Sokal, RR and Michener, CD. “A statistical method for evaluating systematic relationships.” Scientific Bulletins. 38(22): pp. 1409–38. 1958. |
[Ede79] | Edelbrock, C. “Mixture model tests of hierarchical clustering algorithms: the problem of classifying everybody.” Multivariate Behavioral Research. 14: pp. 367–84. 1979. |
[Jai88] | Jain, A., and Dubes, R., “Algorithms for Clustering Data.” Prentice-Hall. Englewood Cliffs, NJ. 1988. |
[Fis36] | Fisher, RA “The use of multiple measurements in taxonomic problems.” Annals of Eugenics, 7(2): 179-188. 1936 |
Copyright (C) Damian Eads, 2007-2008. New BSD License.
A tree node class for representing a cluster. Leaf nodes correspond to original observations, while non-leaf nodes correspond to non-singleton clusters.
The to_tree function converts a matrix returned by the linkage function into an easy-to-use tree representation.
Seealso : |
|
---|
Methods
get_count() | The number of leaf nodes (original observations) belonging to the cluster node nd. |
get_id() | The identifier of the target node. |
get_left() | Returns a reference to the left child tree object. |
get_right() | Returns a reference to the right child tree object. |
is_leaf() | Returns True iff the target node is a leaf. |
pre_order([func]) | Performs preorder traversal without recursive function calls. |
The number of leaf nodes (original observations) belonging to the cluster node nd. If the target node is a leaf, 1 is returned.
Returns : |
|
---|
The identifier of the target node. For 0 \leq i < n, i corresponds to original observation i. For n \leq i < 2n-1, i corresponds to non-singleton cluster formed at iteration i-n.
Returns : |
|
---|
Returns a reference to the left child tree object. If the node is a leaf, None is returned.
Returns : |
|
---|
Returns a reference to the right child tree object. If the node is a leaf, None is returned.
Returns : |
|
---|
Returns True iff the target node is a leaf.
Returns : |
|
---|
Performs preorder traversal without recursive function calls. When a leaf node is first encountered, func is called with the leaf node as its argument, and its result is appended to the list.
For example, the statement:
ids = root.pre_order(lambda x: x.id)
returns a list of the node ids corresponding to the leaf nodes of the tree as they appear from left to right.
Parameters : |
|
---|---|
Returns : |
|
Performs average/UPGMA linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|
Performs centroid/UPGMC linkage. See linkage for more information on the return structure and algorithm.
The following are common calling conventions:
Z = centroid(y)
Performs centroid/UPGMC linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Z = centroid(X)
Performs centroid/UPGMC linkage on the observation matrix X using Euclidean distance as the distance metric. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|
Performs complete complete/max/farthest point linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Calculates the cophenetic distances between each observation in the hierarchical clustering defined by the linkage Z.
Suppose p and q are original observations in disjoint clusters s and t, respectively and s and t are joined by a direct parent cluster u. The cophenetic distance between observations i and j is simply the distance between clusters s and t.
Parameters : |
|
---|---|
Returns : | (c, {d}) - c : ndarray
|
Checks if a linkage matrix Z and condensed distance matrix Y could possibly correspond to one another.
They must have the same number of original observations for the check to succeed.
This function is useful as a sanity check in algorithms that make extensive use of linkage and distance matrices that must correspond to the same set of original observations.
Arguments : |
|
---|---|
Returns : |
|
dendrogram. The dendrogram illustrates how each cluster is composed by drawing a U-shaped link between a non-singleton cluster and its children. The height of the top of the U-link is the distance between its children clusters. It is also the cophenetic distance between original observations in the two children clusters. It is expected that the distances in Z[:,2] be monotonic, otherwise crossings appear in the dendrogram.
Arguments: |
|
---|
color_threshold : double For brevity, let t be the color_threshold. Colors all the descendent links below a cluster node k the same color if k is the first node below the cut threshold t. All links connecting nodes with distances greater than or equal to the threshold are colored blue. If t is less than or equal to zero, all nodes are colored blue. If color_threshold is None or ‘default’, corresponding with MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]).
get_leaves : bool Includes a list R['leaves']=H in the result dictionary. For each i, H[i] == j, cluster node j appears in the i th position in the left-to-right traversal of the leaves, where j < 2n-1 and i < n.
orientation : string The direction to plot the dendrogram, which can be any of the following strings
- ‘top’: plots the root at the top, and plot descendent
links going downwards. (default).
- ‘bottom’: plots the root at the bottom, and plot descendent
links going upwards.
- ‘left’: plots the root at the left, and plot descendent
links going right.
- ‘right’: plots the root at the right, and plot descendent
links going left.
labels : ndarray By default labels is None so the index of the original observation is used to label the leaf nodes.
Otherwise, this is an n -sized list (or tuple). The labels[i] value is the text to put under the i th leaf node only if it corresponds to an original observation and not a non-singleton cluster.
count_sort : string/bool For each node n, the order (visually, from left-to-right) n’s two descendent links are plotted is determined by this parameter, which can be any of the following values:
- False: nothing is done.
- ‘ascending’/True: the child with the minimum number of
original objects in its cluster is plotted first.
- ‘descendent’: the child with the maximum number of
original objects in its cluster is plotted first.
Note distance_sort and count_sort cannot both be True.
distance_sort : string/bool For each node n, the order (visually, from left-to-right) n’s two descendent links are plotted is determined by this parameter, which can be any of the following values:
- False: nothing is done.
- ‘ascending’/True: the child with the minimum distance
between its direct descendents is plotted first.
- ‘descending’: the child with the maximum distance
between its direct descendents is plotted first.
Note distance_sort and count_sort cannot both be True.
show_leaf_counts : bool
When True, leaf nodes representing k>1 original observation are labeled with the number of observations they contain in parentheses.
no_plot : bool When True, the final rendering is not performed. This is useful if only the data structures computed for the rendering are needed or if matplotlib is not available.
no_labels : bool When True, no labels appear next to the leaf nodes in the rendering of the dendrogram.
leaf_label_rotation : double
Specifies the angle (in degrees) to rotate the leaf labels. When unspecified, the rotation based on the number of nodes in the dendrogram. (Default=0)
leaf_font_size : int Specifies the font size (in points) of the leaf labels. When unspecified, the size based on the number of nodes in the dendrogram.
leaf_label_func : lambda or function
When leaf_label_func is a callable function, for each leaf with cluster index k < 2n-1. The function is expected to return a string with the label for the leaf.
Indices k < n correspond to original observations while indices k \geq n correspond to non-singleton clusters.
For example, to label singletons with their node id and non-singletons with their id, count, and inconsistency coefficient, simply do:
# First define the leaf label function. def llf(id): if id < n: return str(id) else: return '[%d %d %1.2f]' % (id, count, R[n-id,3]) # The text for the leaf nodes is going to be big so force # a rotation of 90 degrees. dendrogram(Z, leaf_label_func=llf, leaf_rotation=90)show_contracted : bool When True the heights of non-singleton nodes contracted into a leaf node are plotted as crosses along the link connecting that leaf node. This really is only useful when truncation is used (see truncate_mode parameter).
link_color_func : lambda/function When a callable function, link_color_function is called with each non-singleton id corresponding to each U-shaped link it will paint. The function is expected to return the color to paint the link, encoded as a matplotlib color string code.
For example:
dendrogram(Z, link_color_func=lambda k: colors[k])colors the direct links below each untruncated non-singleton node k using colors[k].
Returns: |
|
---|
Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.
Parameters : | Z : ndarray
t : float
criterion : str, optional
depth : int, optional
R : ndarray, optional
monocrit : ndarray, optional
|
---|---|
Returns : | fcluster : ndarray
|
Cluster observation data using a given metric.
Clusters the original observations in the n-by-m data matrix X (n observations in m dimensions), using the euclidean distance metric to calculate distances between original observations, performs hierarchical clustering using the single linkage algorithm, and forms flat clusters using the inconsistency method with t as the cut-off threshold.
A one-dimensional array T of length n is returned. T[i] is the index of the flat cluster to which the original observation i belongs.
Parameters : | X : ndarray
t : float
criterion : str, optional
method : str, optional
metric : str, optional
t : double, optional
depth : int, optional
R : ndarray, optional
|
---|---|
Returns : | T : ndarray
|
Notes
This function is similar to the MATLAB function clusterdata.
Converts a linkage matrix generated by MATLAB(TM) to a new linkage matrix compatible with this module. The conversion does two things:
- the indices are converted from 1..N to 0..(N-1) form, and
- a fourth column Z[:,3] is added where Z[i,3] is represents the number of original observations (leaves) in the non-singleton cluster i.
This function is useful when loading in linkages from legacy data files generated by MATLAB.
Arguments : |
|
---|---|
Returns : |
|
Calculates inconsistency statistics on a linkage.
Note: This function behaves similarly to the MATLAB(TM) inconsistent function.
Parameters : |
|
---|---|
Returns : |
|
Determines if two different cluster assignments T1 and T2 are equivalent.
Arguments : |
|
---|
Returns True if the linkage passed is monotonic. The linkage is monotonic if for every cluster s and t joined, the distance between them is no less than the distance between any previously joined clusters.
Arguments : |
|
---|---|
Returns : |
|
Returns True if the inconsistency matrix passed is valid. It must be a n by 4 numpy array of doubles. The standard deviations R[:,1] must be nonnegative. The link counts R[:,2] must be positive and no greater than n-1.
Arguments : |
|
---|---|
Returns : |
|
Checks the validity of a linkage matrix. A linkage matrix is valid if it is a two dimensional nd-array (type double) with n rows and 4 columns. The first two columns must contain indices between 0 and 2n-1. For a given row i, 0 \leq \mathtt{Z[i,0]} \leq i+n-1 and 0 \leq Z[i,1] \leq i+n-1 (i.e. a cluster cannot join another cluster unless the cluster being joined has been generated.)
Arguments : |
|
---|---|
Returns : |
|
(L, M) = leaders(Z, T):
Returns the root nodes in a hierarchical clustering corresponding to a cut defined by a flat cluster assignment vector T. See the fcluster function for more information on the format of T.
For each flat cluster j of the k flat clusters represented in the n-sized flat cluster assignment vector T, this function finds the lowest cluster node i in the linkage tree Z such that:
- leaf descendents belong only to flat cluster j (i.e. T[p]==j for all p in S(i) where S(i) is the set of leaf ids of leaf nodes descendent with cluster node i)
- there does not exist a leaf that is not descendent with i that also belongs to cluster j (i.e. T[q]!=j for all q not in S(i)). If this condition is violated, T is not a valid cluster assignment vector, and an exception will be thrown.
Arguments : |
|
---|---|
Returns : | (L, M)
|
Returns a list of leaf node ids (corresponding to observation vector index) as they appear in the tree from left to right. Z is a linkage matrix.
Arguments : |
|
---|---|
Returns : |
|
condensed distance matrix y. y must be a {n \choose 2} sized vector where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB(TM) linkage function.
A 4 by (n-1) matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
The following linkage methods are used to compute the distance d(s, t) between two clusters s and t. The algorithm begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters s and t from this forest are combined into a single cluster u, s and t are removed from the forest, and u is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root.
A distance matrix is maintained at each iteration. The d[i,j] entry corresponds to the distance between cluster i and j in the original forest.
At each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster u with the remaining clusters in the forest.
Suppose there are |u| original observations u[0], \ldots, u[|u|-1] in cluster u and |v| original objects v[0], \ldots, v[|v|-1] in cluster v. Recall s and t are combined to form cluster u. Let v be any remaining cluster in the forest that is not u.
The following are methods for calculating the distance between the newly formed cluster u and each v.
method=’single’ assigns
d(u,v) = \min(dist(u[i],v[j]))
for all points i in cluster u and j in cluster v. This is also known as the Nearest Point Algorithm.
method=’complete’ assigns
d(u, v) = \max(dist(u[i],v[j]))
for all points i in cluster u and j in cluster v. This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.
method=’average’ assigns
d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}
for all points i and j where |u| and |v| are the cardinalities of clusters u and v, respectively. This is also called the UPGMA algorithm. This is called UPGMA.
method=’weighted’ assigns
d(u,v) = (dist(s,v) + dist(t,v))/2
where cluster u was formed with cluster s and t and v is a remaining cluster in the forest. (also called WPGMA)
method=’centroid’ assigns
dist(s,t) = ||c_s-c_t||_2
where c_s and c_t are the centroids of clusters s and t, respectively. When two clusters s and t are combined into a new cluster u, the new centroid is computed over all the original objects in clusters s and t. The distance then becomes the Euclidean distance between the centroid of u and the centroid of a remaining cluster v in the forest. This is also known as the UPGMC algorithm.
method=’median’ assigns math:d(s,t) like the centroid method. When two clusters s and t are combined into a new cluster u, the average of centroids s and t give the new centroid u. This is also known as the WPGMC algorithm.
method=’ward’ uses the Ward variance minimization algorithm. The new entry d(u,v) is computed as follows,
d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 + \frac{|v|} {T}d(s,t)^2}
where u is the newly joined cluster consisting of clusters s and t, v is an unused cluster in the forest, T=|v|+|s|+|t|, and |*| is the cardinality of its argument. This is also known as the incremental algorithm.
Warning: When the minimum distance pair in the forest is chosen, there may be two or more pairs with the same minimum distance. This implementation may chose a different minimum than the MATLAB(TM) version.
Parameters: |
|
---|
Returns : |
|
---|
Returns the maximum statistic for each non-singleton cluster and its descendents.
Arguments : |
|
---|---|
Returns : |
|
Returns the maximum distance between any cluster for each non-singleton cluster.
Arguments : |
|
---|---|
Returns : |
|
Returns the maximum inconsistency coefficient for each non-singleton cluster and its descendents.
Arguments : |
|
---|---|
Returns : |
|
Performs median/WPGMC linkage. See linkage for more information on the return structure and algorithm.
The following are common calling conventions:
Z = median(y)
Performs median/WPGMC linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Z = median(X)
Performs median/WPGMC linkage on the observation matrix X using Euclidean distance as the distance metric. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|
Returns the number of original observations of the linkage matrix passed.
Arguments : |
|
---|---|
Returns : |
|
Changes the list of matplotlib color codes to use when coloring links with the dendrogram color_threshold feature.
Arguments : |
the color codes is the order in which the colors are cycled through when color thresholding in the dendrogram. |
---|
Performs single/min/nearest linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|
Converts a linkage matrix Z generated by the linkage function of this module to a MATLAB(TM) compatible one. The return linkage matrix has the last column removed and the cluster indices are converted to 1..N indexing.
Arguments : |
|
---|---|
Returns : |
|
Converts a hierarchical clustering encoded in the matrix Z (by linkage) into an easy-to-use tree object. The reference r to the root ClusterNode object is returned.
Each ClusterNode object has a left, right, dist, id, and count attribute. The left and right attributes point to ClusterNode objects that were combined to generate the cluster. If both are None then the ClusterNode object is a leaf node, its count must be 1, and its distance is meaningless but set to 0.
Note: This function is provided for the convenience of the library user. ClusterNodes are not used as input to any of the functions in this library.
Parameters : |
|
---|---|
Returns : |
|
Performs Ward’s linkage on a condensed or redundant distance matrix. See linkage for more information on the return structure and algorithm.
The following are common calling conventions:
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|
Performs weighted/WPGMA linkage on the condensed distance matrix y. See linkage for more information on the return structure and algorithm.
Parameters : |
|
---|---|
Returns : |
|
Seealso : |
|