Tagarchief: JAVA

K Means clustering and Histogram equalization source code

As there was some interest in source code in two earlier posts (post 1, post 2) I’ve posted the source code here.

It’s also available through the download page.

Please note that the code itself is at least 4 years old and there are far better ways to solve some things. I therefore suggest that it is used for studying purposes only. Comments and improvments are very much appreciated.

Getagged , , , ,

K-means clustering in Java code found!

My blogpost on K-means clustering has the highest number of views, so people are probably interested in it. Sadly enough I lost the source code of the K-means action a while ago. Last week I needed an external harddisk to make a back-up of some files. There was already some content on the disk. I found quite some pieces of code including the K-means code. Although it is quite simple code operating on (if I remember correctly 8-bit) greyscale images, it might give some insights in how to do this.

Verder lezen

Getagged , , ,

Frequent Itemset Mining implementation in JAVA

Huge datasets, often containing important operational knowledge, defy standard data analysis methods. Traditional data analysis methods do not easily scale from analyzing megabytes of data to analyzing terra- or peta-bytes of data, nor from analyzing low dimensional data to analyzing very high dimensional data. Furthermore, results may become difficult or almost impossible to interpret by the end-user because of their size and complexity. These are several of the problems that novel data mining methods try to solve. Frequent Itemset Mining focuses on deriving association rules which can then be used to classify new incoming data. The classic example is the shopping cart example with the ‘myth’: people who buy diapers also buy beer. If there is a large confidence of the rule D(iapers) => B(eer), it will actually be there in the ouput of the algorithm. For a concern this is perhaps nice to know, so that they can adjust their shop to it (like putting the diapers close to the beer or something) and their sales (if you buy an extra pack of diapers, you get a 50% discount on beer).
There is a large variety of Frequent Itemset Mining algorithms available on the internet. Because none of them was of direct use, I’ve made an implementation itself. It has its known downsides (see below), but it hopefully provides a start for people who want to do more with FIM.
With this Frequent Itemset Mining implementation I’ve implemented 2 algorithms which are capable of doing this: The Apriori algorithm and the FP-Tree algorithm. Note that these packages are not created by me.

Input of the program / code:
A “.data” comma separated file.

Known downsides of the program / code:
> The user is asked for two classes when performing the scanning algorithm.
There are a lot of cases where there are more than two classes in a data file.
For making it possible to handle these files, the code just has to be adjusted
slightly: the algorithm must look itself for the number of different classes
and perform the partitionscans on all these different classes.
> File rewriting is necessary at this point for the algorithm to work. The
speed could be much improved if this isn’t necessary any more. For that the
code for these algorithms needs to be rewritten to handle lists directly.
> The hash function which is present in the program is pretty basic and
probably not sufficient for large datasets. This could be solved by
implementing a stronger hash function (or using JAVA’s hash function).
In overall, I’ve tried to keep all functions as generic as possible so that
further extension is actually possible, so in that point of view, the code that
I’ve written is pretty scalable and by adjusting some functions (hashfunction
and scanning parameters) slightly, it can be even more scalable. All CSV files
can be read and the program reacts apropriate on the incoming data. If the data
is correct and it can be handled, the algorithm can start working with this
input and it produces a result file.
I was also surprised to see how many warnings my Eclipse environment generated
for the used source codes. Most of them can be solved directly, but some need
more time.

Download
The JAR file, source and test input file can be found in the download section.

Find more on Frequent Itemset Mining: http://www.google.nl/search?source=ig&hl=nl&rlz=&q=Frequent+Itemset+Mining&btnG=Google+zoeken
More on Association Rule learning:
http://www.google.nl/search?hl=nl&q=Association+Rule+Learning&btnG=Zoeken

Library references:
CSVReader: OpenCSV (http://opencsv.sourceforge.net)
Hash function: http://www.cs.usfca.edu/galles/cs245/hash.java.html
Apriori algorithm: http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_prog1.html
FPGrowth algorithm: http://www.csc.liv.ac.uk/~frans/KDD/Software/FPGrowth/fpGrowth.html

Getagged , , ,

K-means clustering implementation in JAVA

Details about K-Means Clustering on images:

Before the algorithm starts, the user needs to set a number of greyvalues (bins). The resulting image will contain that number of greyvalues.
With that number of bins (called ‘k’) the algorithm clusters the greyvalues of the image into k clusters and once the algorithm is terminated, every cluster will have its own greyvalue.
With starting the algorithm, you should set:

  • How to define the ‘startingmeans’ of the clusters before the first iteration.
  • What the stopping criteria are.

The Algorithm:

In short this is what the algorithm is supposed to do:
Initialize (so set k, set ‘startingmeans’, set stopping criteria)
Loop while termination condition isn’t met (

  • For each pixel: assign the pixel to a class such that the distance from the pixel to the center (the mean) of a class is minimalized.
  • For each class: recalculate the means of the class based on the pixels belonging to that class.

)

My implementation:

The user can set his k (which is fairly easy).
I’ve implemented 3 ways to choose the ‘startingmeans’ this far:

  1. i mod k class: The pixel at index i is assigned to the class i modulo k
  2. Distribute mean table over color space: According to the k that’s chosen the means are chosen so that the are spreaded equally over the complete color space of the image.
  3. Random: Just as it says. Given a k, there will be chosen k random mean values.

The termination constraints are currently not visible for users and are set to:
Terminate after fewer than n pixels change classes after a recalculation of the means.
I’ve set my n to 300 which is pretty small if you are using images bigger than 512 by 512 pixels. Next to that, the algorithm will be terminated if there are more than j iterations needed to get a stable result (in the meanings of that there are not more than n pixels changing classes after a recalculation of the means). My j is currently set to 50. Most of the times the algorithm terminates because of less than 300 pixels have changed classes.

Now that we’ve seen how the parameters of the algorithm are set, let’s have a look how I’ve implemented the algorithm in terms of code and decisions I’ve made.

I’ve devided the code over 3 classes:

1 to build the JDialog which is needed to ask for the input of the user concerning the way the algorithm needs to be initiated.
One with the actual algorithm and the last class is a clusterclass.

Because the class with the JDialog is not that interesting, we’ll focus on the other two classes.

The ClusterClass is pretty simple: it only holds a mean, an upperbound and a lowerbound.
I’ve chosen for the fact that this class holds the bounds because at the initialization of the algorithm, there are k classes which are created (and put into an ArrayList). You can let each class hold it’s own pixels which are belonging to that class, but if your k grows and the image is big, the complete image will be twice in the memory: as the original image and all pixels will be part of one of the clusterclasses as well. Instead of that I’ve chosen to hold the bounds of that class so that if I’m checking pixelvalues, it can also check to which class it belongs in the same for-loop.

As mentioned earlier: a pixel belongs to a class if the distance from that pixelvalue to the mean of a class is minimized. Because my ClusterClasses hold their upper- and lowerbound, a pixelvalue has to lay between the bounds to be part of that class. The bounds are simply calculated by checking which mean is the nearest (but have a lower value for the lowerbound and a higher bound for the upperbound). The bound can simply be calculated by taking the mean of these two means.
After every pixel is assigned to a class (In my case: it can check to which class it belongs). The means of the classes can be recalculated by taking the sum of all the pixelvalues belonging to that clusterclass and divide this sum by the number of pixels in the clusterclass.
After the recalculations of the means, the upper- and lowerbounds need to be recalculated as well.
After this iteration, the termination condition has to be checked. If the condition isn’t met, another iteration follows. If the condition is met, the clusters are set and the colors of the image can be recalculated.

And now shortly in JAVA:

public KMeansAction:

initialize
calculateBounds
while (not_terminated) do:

  • recalculateMeans
  • recalculateBounds
  • checkTermination

processImage

private void processImage:

// This works for 8-bit greyscale images
// It calculates the greyvalues that will occur in the resulting image
delta = 255 / ( k – 1)
for every pixel do:
for every class do:
if a pixel belongs to that class then

// set the greyvalue of that pixel to the index of the class in the list times delta

greyvalue = classindex * delta
// then set the rgbvalue of that pixel to the greyvalue
newImage.setRGB(pixel location, greyvalue)

NOTE (August 7, 2009): I’ve found the source code and put it in this blogpost.

Getagged , ,