SIngle-LInkage CONnectivity clustering ====================================== A Python library for a fast approximation of `single-linkage clustering `_ with given eclidean distance or cosine similarity threshold. Supports both dense arrays (numpy) and sparse matrices (scipy). Also offers simple cluster visualisation with matplotlib. This module grew into a separate library strating as an data exploratory project in `CZ.NIC labs `_ to cluster captured honeypot records. Github: `CZ-NIC/silicon-clustering `_ PyPI: `silicon-clustering `_ Author: Tomáš Gavenčiak, gavento@ucw.cz .. toctree:: :maxdepth: 2 :caption: Contents: :hidden: api Usage ----- :: import silicon, numpy # use a fixed seed to get the same data rnd = numpy.random.RandomState(42) # create some data data = rnd.uniform(-1.0, 1.0 ,(1000, 3)) # create and run the clustering instance c = silicon.CosineClustering(data, rnd=rnd, sim_threshold=0.995) c.run() print(c.clusters()) import matplotlib.pyplot as plt # plot the data overview c.plot() plt.show() # plot the largest cluster c.clusters_by_size()[0].plot() plt.show() .. image:: usage-all.png :width: 51 % .. image:: usage-cluster.png :width: 100 % Algorithm --------- Given a matrix, the algorithm computes row clusters with the condition that any two rows with cosine similarity above a given threshold (resp. with euclidean distance below a threshold) belong to the same cluster. This corresponds to connected components of the graph over the rows where two rows are connected if similar (or close) enough. All the distance computation and vector operations are done by numpy and scipy to speed up the computation and save some memory. Cosine similarity +++++++++++++++++ The cosine similarity computation is transformed to an instance of the euclidean distance by normalizing the row vector lengths and computing the threshold distance: .. image:: cosine.png Euclidean distance ++++++++++++++++++ Rather than computing the distance of all row pairs, the algorithm projects the rows into *k*-dimensional grid of cubes ("cells") of side *d* - the threshold distance. The projection basis is selected as the *k* most principal PCA vectors. The projection has the property that only the rows in adjacent cells may be in distance less than *d* and we only check all pairs of these. For sparse point configurations, this is already fast enough, but with uneven point distribution and dense areas, this can still be slow. A further optimization is a nibble procedure done before dividing the points into cells: Randomly sample a point and compute the distance to all others, joining the clusters of close rows. If this hits a dense area, most of the points there are clustered in one scan. Most points clustered in this way are excluded from the all-pair computation, but some are included to allow the nibbled clusters to be extended with other (non-nibbled) clusters if appropriate. The rule is to return 10% rows per cell, at least 100 (if possible) and at most 1000. See :py:meth:`silicon.CosineClustering.compute_cell_mask` for details. Excluding most of the nibbled rows from pairwise distance computation may lead to missing some clustering links (e.g. between a point with just a single close point that has is on the border of a nibbled sphere), but our experiments have shown this to be unlikely -- in dense areas, there are usually many alternative links joining the clusters, and in sparse cells, all the rows are considered. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`