Skip to main content

Massive Clustering

The research is dedicated to the realisation of a platform capable of performing the clustering of massive data either in batch mode or in streaming and real-time. This activity also includes the construction of a visualisation tool to represent the graph, the categories and communities aggregated by the algorithm. An immediate application of this platform is on the analysis of major social networks. We also conduct analysis of the connectivity of the network defined by relations, nodes interactions or with the similarity driven by content between the network nodes, that is, according to communities or groups of interest.

Clustering massive content is done by the LSH technique (Locality Sensitive Hashing) and consists of an application of a cascade of three processes. The first component in sequences of contiguous elements (k-shake). The second component builds an index of the k-shingles (called the signature), which is a very compact representation of the original data. Finally LSH, cuts the signature in bands, and assigns the elements to clusters when occurring partial collisions between corresponding bands.

Algorithms like the Cluster Affiliation Models for Big Networks (BigCLAM) are used to make the clustering starting from the interactions between the nodes. Similarly to LSH, these algorithms are scalable, being of linear complexity with respect to the size of the data.

The implementation of LSH is performed on a cluster of machines using the distributed programming technique of MapReduce, on Hadoop and Spark. Once the Clustering is concluded, hundreds of thousands of nodes containing all network interactions can be displayed. Thanks to the scalability of clustering algorithm, obtained by LSH or BigCLAM, one is indeed able to navigate the network entering and exploring in depth the different communities and sub-communities.