Agricultural Journal

Abstract

Retrieval of information from the databases is now-a-day’s significant issues. The thrust of information for decision making is challenging one. To overcome this problem, different techniques have been developed for this purpose. One of techniques is data clustering. In this study, Data clustering methods are discussed along with its two traditional approaches and their algorithms. Some applications of data clustering like data mining, using data clustering and similarity searching in medical image databases are also discussed. Same techniques will be applied to agriculture issues.

INTRODUCTION

Clustering is a one of the method in data mining. It means that process of grouping a set of physical or abstract objects into classes of similar objects (Arabie and Hubert, 1996). This study reviews the different comprehensive methods of clustering techniques. Clustering is grouping of data into similar groups with respect to similarity of the data. The cluster has the characteristics of more similarity within the group. While making the cluster, there is possibility of losing the fine details. But cluster achieves the simplification. Clustering is a process of unsupervised learning which results the data concept. It is otherwise described as unsupervised learning of a hidden data concept. Though, lots of techniques exist in data mining, the clustering techniques in large databases are reviewed here.

Data mining is the process of discovering meaningful new correlation, patterns and trends by sifting through large amounts of data, using pattern recognition technologies as well as statistical and mathematical techniques (Tellaeche et al., 2008). Data mining is a knowledge discovery process of extracting previously unknown, actionable information from very large databases (Tellaeche et al., 2008). Clustering is one of the 1st steps in data mining analysis. It identifies groups of related records that can be used as a starting point for exploring further relationships. This technique supports the development of population segmentation models such as demographic-based customer segmentation. Additional analyses using standard analytical and other data mining techniques can determine the characteristics of these segments with respect to some desired outcome. The buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign. And also a company which is selling a variety of products may need to know about the sale of all of their products in order to check that what product is giving extensive sale and which is lacking are examples for clustering. This will be done by data mining techniques. But if the system, clusters the products that are giving fewer sales then only the cluster of such products would have to be checked rather than comparing the sales value of all the products. This is actually to facilitate the mining process.

Clustering genesis: More references are made in the clustering techniques. Studies carried on clustering include (Jain et al., 1999; Fasulo, 1999; Kolatch, 2001; Klise and McKenna, 2006; Ghosh, 2002). Clustering techniques are having close relationship with other disciplines. This technique has been used in statistics (Arabie and Hubert, 1996). In the agricultural science, data mining clustering techniques are found in Grading apples before marketing (Leemans Destain, 2004), detecting weeds in precision agriculture (Tellaeche et al., 2008) and monitoring water quality changes (Klise and McKenna, 2006).

Clustering techniques are widely used in data compression in image processing, it is otherwise known as vector quantization (Gersho and Gray, 1992). This study just briefs the clustering techniques of different areas. Clustering in data mining was brought to routine by greatest development in information retrieval and text mining (Steinbach et al., 2000; Dhillon et al., 2001). This techniques has been used in different areas such as spatial data base applications GIS or astronomical data (Ester et al., 2000). This techniques can also been studied in sequence and heterogeneous data analysis (Cadez et al., 2001) and web applications (Heer and Chi, 2001; Foss et al., 2001). References for these techniques can also be found in DNA analysis in computational biology (Ben-Dor and Yakhini, 1999).

CLUSTERING METHODS

There are many clustering methods available and each of them may give a different grouping of a dataset. The choice of a particular method will depend on the type of output desired, the known performance of method with particular types of data, the hardware and software facilities available and the size of the dataset. In general, clustering methods may be divided into two categories based on the cluster structure which they produce. The non-hierarchical methods divide a dataset of N objects into M clusters with or without overlap.

These methods are sometimes divided into Partitioning methods in which the classes are mutually exclusive and the less common Clumping method in which overlap is allowed. Each object is a member of the cluster with which it is most similar however, the threshold of similarity has to be defined. The Hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster, until only one cluster remains. The Hierarchical methods can be further divided into Agglomerative or divisive methods. In Agglomerative methods, the hierarchy is build up in a series of N-1 agglomerations or fusion of pairs of objects beginning with the un-clustered dataset.

The less common divisive methods begin with all objects in a single cluster and at each of N-1 steps divide some clusters into two smaller clusters, until each object resides in its own cluster.

PARTITIONING METHODS

The Partitioning methods generally result in a set of M clusters each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of summary description of all the objects contained in a cluster. The precise form of this description will depend on the type of the object which is being clustered. In case where, real-valued data is available, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases, e.g., a cluster of documents can be represented by a list of those key words that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset.

Single pass: A very simple partition method, the Single pass method creates a partitioned dataset as follows:

Make the 1st object, the centroid for the 1st cluster. For the next object, calculate the similarity, S with each existing cluster centroid, using some similarity coefficient. If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re-determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2.

As its name implies, this method requires only one pass through the dataset; the time requirements are typically of order O(NlogN) for order O(logN) clusters. This makes it a very efficient Clustering method for a serial processor. A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed with the 1st clusters formed usually being larger than those created later in the clustering run.

HIERARCHICAL AGGLOMERATIVE METHODS

The Hierarchical agglomerative clustering methods are most commonly used. The construction of a hierarchical agglomerative classification can be achieved by two closest objects and merge them into a cluster and also find and merge the next two closest points where, a point is either an individual object or a cluster of objects. Individual methods are characterized by the definition used for identification of the closest pair of points and by the means used to describe the new cluster when two clusters are merged here are some general approaches to implementation of this algorithm, these being stored matrix and stored data are discussed. In the 2nd matrix approach, an N*N matrix containing all pair wise distance values is 1st created and updated as new clusters are formed. This approach has at least an O (n*n) time requirement, rising to O(n³) if a simple serial scan of dissimilarity matrix is used to identify the points which need to be fused in each agglomeration, a serious limitation for large N. The stored data approach required the recalculation of pair wise dissimilarity values for each of the N-1 agglomerations and the O(N) space requirement is therefore, achieved at the expense of an O(N³) time requirement.

SINGLE LINK METHOD (SLINK)

The Single link method is probably the best known of the Hierarchical methods and operates by joining at each step, the two most similar objects which are not yet in the same cluster. The name single link thus, refers to the joining of pairs of clusters by the single shortest link between them.

COMPLETE LINK METHOD (CLINK)

The Complete link method is similar to the Single link method except that it uses the least similar pair between two clusters to determine the inter-cluster similarity (so that every cluster member is more like the furthest member of its own cluster than the furthest item in any other cluster). This method is characterized by small, tightly bound clusters.

GROUP AVERAGE METHOD

The group average method relies on the average value of the pair wise within a cluster, rather than the maximum or minimum similarity as with the Single link or the Complete link methods. Since, all objects in a cluster contribute to the inter-cluster similarity each object is on average more like every other member of its own cluster then the objects in any other cluster.

TEXT BASED DOCUMENTS

In the text based documents, the clusters may be made by considering the similarity as some of the key words that are found for a minimum number of times in a document. Now when a query comes, regarding a typical word then instead of checking the entire database, only that cluster is scanned which has that word in the list of its key words and the result is given. The order of the documents received in the result is dependent on the number of times that key word appears in the document.

CATEGORIZATION OF CLUSTERING ALGORITHMS

Algorithms are key step for solving the techniques. In these clustering techniques, various algorithms are currently in the life, still lot more are evolving. But in general, the algorithm for clustering is neither straight nor canonical:

Hierarchical methods:

•	Agglomerative algorithms
•	Divisive algorithms

Partitioning methods:

•	Relocation algorithms
•	Probabilistic clustering
•	K-medoids methods
•	K-means methods

Density-based algorithms:

•	Density-based connectivity clustering
•	Density functions clustering

Grid-based methods:

•	Methods based on co-occurrence of categorical data
•	Constraint-based clustering
•	Clustering algorithms used in machine learning
•	Gradient descent and artificial neural networks
•	Evolutionary methods

Scalable clustering algorithms:

•	Algorithms for high dimensional data
•	Subspace clustering
•	Projection techniques
•	Co-clustering techniques

Normally, Clustering techniques are broadly divided in hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. The ground rules of hierarchical clustering include Lance-Williams formula, idea of conceptual clustering. Some of classic algorithms SLINK, COBWEB, CURE and CHAMELEON are under hierarchical clustering. While hierarchical algorithms build clusters gradually, partitioning algorithms learn clusters directly. These algorithms try to discover clusters by iteratively relocating points between subsets or try to identify clusters as areas highly populated with data. Algorithms of the 1st kind are surveyed in the section Partitioning relocation methods. They are further categorized into probabilistic clustering (EM framework, algorithms SNOB, AUTOCLASS, MCLUST), k-medoids methods (algorithms PAM, CLARA, CLARANS and its extension) and k-means methods (different schemes, initialization, optimization, harmonic means, extensions). Such methods concentrate on how well points fit into their clusters and tend to build clusters of proper convex shapes.

APPLICATIONS

Data clustering has immense number of applications in every field of life. One has to cluster a lot of thing on the basis of similarity, either consciously or unconsciously. So, the history of data clustering is old as the history of mankind. In computer field also, use of data clustering has its own value. Specially in the field of information, retrieval data clustering plays an important ole. In order to detect many diseases like tumor, etc., the scanned pictures or the X-rays are compared with the existing ones and the dissimilarities are recognized. Medical field have clusters of images of different parts of the body. For example, the images of the CT scan of brain are kept in one cluster. To further arrange things, the images in which the right side of the brain is damaged are kept in one cluster.

The hierarchical clustering is used. The stored images have already been analyzed and a record is associated with each image. In this form, a large database of images is maintained using the hierarchical clustering. Now when a new query image comes, it is firstly recognized that what particular cluster this image belongs and then by similarity, matching with a healthy image of that specific cluster the main damaged portion or the diseased portion is recognized. Then the image is sent to that specific cluster and matched with all the images in that particular cluster. Now, the image with which the query image has the most similarities is retrieved and the record associated to that image is also associated to the query image. This means that now the disease of the query image has been detected.

Using this technique and some really precise methods for the pattern matching, diseases like really fine tumor can also be detected. So by using clustering, an enormous amount of time in finding the exact match from the database is reduced.

CONCLUSION

In this study, the basic concept of clustering and clustering techniques are given. The processes of grouping, a set of physical or abstract objects into classes of similar, objects are named as clustering. These techniques are being used in many areas such as marketing, agriculture, biology and medical. This study finds that clustering techniques become a highly active research area in data mining research.

How to cite this article:

L. Jeyasimman, S.S. Baskar and L. Arockiam. Overview of Clustering Techniques in Agriculture Data Mining.
DOI: https://doi.org/10.36478/aj.2011.222.225
URL: https://www.makhillpublications.co/view-article/1816-9155/aj.2011.222.225

Agricultural Journal

129
Views

0
Downloads

Overview of Clustering Techniques in Agriculture Data Mining

Abstract

How to cite this article:

Agricultural Journal

129Views

0Downloads

Overview of Clustering Techniques in Agriculture Data Mining

Abstract

How to cite this article:

129
Views

0
Downloads