Clustering in Data Mining
- Clustering is that the process of creating a group of abstract objects into classes of comparable objects. A cluster of data objects are often treated together group.
- While doing cluster analysis, we first partition the set of data into groups supported data similarity then assign the labels to the groups.
Datamining Cluster Analysis
Let's understand this with an example, suppose we are a market manager, and we have a new tempting product to sell. We are sure that the product would bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best suited for the product from our company's huge customer base?
- Good Clustering Algorithm Aims:
- Intra-cluster similarities are high, It implies that the data present inside the cluster is similar to one another.
- Inter-cluster similarity is low, It means cluster holds data that is not similar to other data.
What is a Cluster ?
- A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it.
- A connected region of a multidimensional space with a comparatively high density of objects.
What is Clustering in Data Mining ?
- The method of converting a group of abstract objects into classes of similar objects.
- Method of partitioning a group of data or objects into a group of serious subclasses called clusters.
- Data objects of a cluster can be considered as one group.
Read Also
Applications of Cluster Analysis in Data Mining
Applications of Cluster Analysis
- Helps in allocating documents on the internet for data discovery.
- Clustering Analysis used in data analysis, market research, pattern recognition, and image processing.
- It can be used to determine plant and animal taxonomies, categorization of genes with the same functionalities and gain insight into structure inherent to populations.
- It is also used in tracking applications such as detection of credit card fraud.
- To find different groups in their client base and based on the purchasing patterns.
Why clustering used in Data Mining ?
- Advanced algorithm may give the best results with one type of data set, but it may fail or perform poorly with other kinds of data set.
Scalability
- Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering should approximately scale to the complexity order of the algorithm.
- For example, if we perform K- means clustering, we all know it's O(n), where n is that the number of objects within the data. Scalability in clustering implies that as we boost the quantity of data objects, the time to perform clustering should approximately scale to the complexity order of the algorithm. If we raise the amount of data objects 10 folds, then the time taken to cluster them should also approximately increase 10 times. It means there should be a linear relationship. If that's not the case, then there's some error with our implementation process.
Interpretability
- Outcomes of clustering be interpretable, comprehensible, usable.
Discovery of clusters with attribute shape
- It should be able to find arbitrary shape clusters. They should not be limited to only distance measurements that tend to discover a spherical cluster of small sizes.
Read Also
Ability to deal with different types of attributes
- It should be capable of being applied to any data like data based on intervals (numeric), binary data, and categorical data.
Ability to deal with Noisy Data
- Databases contain data that is noisy, missing, or incorrect.
High Dimensionality
- Tools should not only able to handle high dimensional data space but also the low-dimensional space.