K-Means Clustering and its Real Use Case in the security domain
K-means Clustering Algorithm
K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.
The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The n attributes in each row represent a point in n-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations. The algorithm groups observations with similar attribute values (the points corresponding to these observations are closer together).
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
How does the K-Means Clustering Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let’s understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
- Let’s take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means here we will try to group these datasets into two different clusters.
- We need to choose some random k points or centroid to form the cluster. These points can be either the points from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the part of our dataset. Consider the below image:
- Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let’s color them as blue and yellow for clear visualization.
- As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
- Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
- We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:
- As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
- We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image:
Challenges with the K-Means Clustering Algorithm
One of the common challenges we face while working with K-Means is that the size of clusters is different. Let’s say we have the below points:
The left and the rightmost clusters are of smaller size compared to the central cluster. Now, if we apply k-means clustering on these points, the results will be something like this:
Another challenge with k-means is when the densities of the original points are different. Let’s say these are the original points:
Here, the points in the red cluster are spread out whereas the points in the remaining clusters are closely packed together. Now, if we apply k-means on these points, we will get clusters like this:
We can see that the compact points have been assigned to a single cluster. Whereas the points that are spread loosely but were in the same cluster, have been assigned to different clusters. Not ideal so what can we do about this?
One of the solutions is to use a higher number of clusters. So, in all the above scenarios, instead of using 3 clusters, we can have a bigger number. Perhaps setting k=10 might lead to more meaningful clusters.
Remember how we randomly initialize the centroids in k-means clustering? Well, this is also potentially problematic because we might get different clusters every time. So, to solve this problem of random initialization, there is an algorithm called K-Means++ that can be used to choose the initial values, or the initial cluster centroids, for K-Means. It specifies a procedure to initialize the cluster centers before moving forward with the standard k-means clustering algorithm.
K-Means Clustering Use Case in Security Domain
This Usecase talks about hotspots and use of k means clustering for crime pattern detection. It first identifies significant attributes in the database. Unlike other, it then gives weights to attributes in the data set. The most important attribute (e.g. type of crime) is given the highest priority (weight) as compared to other attributes. They used are this feature of the research paper in our project. The data with missing values are made as test cases. It suggests dividing the database according to respective states, using classification, to make the data easier to analyze. In this project, they subdividing the data into different types of crime allowing the user to get information of those crimes easily (e.g. percentage of the particular crime in a particular year, the hotspot of that particular crime).
Objective was Development of a spatiotemporal crime prediction model based on geographical information systems coupled with spatial statistical methods. In this ,clustering analyses are used to identify hot spots. Cluster analysis aims to collect data into groups according to several algorithms which are Kmeans, Nnh hierarchical, spatiotemporal analysis of crime (STAC), fuzzy, ISODATA, and geographical analysis machine (GAM) clustering techniques. Clusters of STAC do include more homogenous areas than the other methods. STAC is not restricted to include all the observations hence STAC is able to indicate denser crime areas than other methods. This is important in crime prevention for allocating resources effectively.
The Cluster has a special meaning which refers to a special group of crime i.e. a lot of crimes in a given specified regions. These clusters can be represented using a geo plot of the crime described on the map of the police jurisdiction.
A. Crime Data Mining for Indian Police Information System
This is all about India’s crime analysis system. It gives ways to enhance the currently existing system in Indian Police System called Crime Criminal Information System (CCIS). It suggests dividing the database according to respective states, using classification, to make the data easier to analyze. In our project, we are subdividing the data into different types of crime allowing the user to get information of those crimes easily (e.g. percentage of the particular crime in a particular year, the hotspot of that particular crime)
B. An Optimal KD Model for Crime Pattern Detection Based on Semantic Link Analysis-A Data Mining Tool
The system finds the critical path of the serial killers who are striking over again and again and determines links between their crime activities locations occurred in the past, travel record, and background history, etc. These findings increase the chance of finding these repeat offenders. The formation to integrate information from various crime incidents and also multiple sources and also discover regular patterns for the structure, organization, operation and information in criminal databases. If a particular criminal uses a pattern of path to commit consecutive crimes, then the next crime location of this serial killer can be predicted from the pattern observed. e.g.: in DHOOM 2, the last location of crime of criminal HRITHIK ROSHAN was predicted by the pattern he formed from his previous crime locations.
C. Evolving Data Mining Algorithms on the Prevailing Crime Trend — An Intelligent Crime Prediction Model A. Malathi and Dr. S. Santosh Baboo
The crime data is divided into days of the week, to observe Spatio temporal distribution of crime. To the clustered results, a classification algorithm was applied to predict the future crime pattern. It enables us to build a model on predicting next records using previous year’s data.
RESULTS OF CRIME PATTERN ANALYSIS
The different clusters or the crime patterns are color-coded. For each group, the legend provides the total number of crimes incidents included in the group along with the significant attributes that characterize the group.
Thank you for reading…..
This task has been done under the Mentorship of Mr. Vimal daga sir in the Summer Program 2021
#worldrecordholder #training #internship #makingindiafutureready #summer #summertraining #python #machinelearning #docker #rightmentor #deepknowledge #linuxworld #vimaldaga #righteducation