ProjectCliparts Data mining
Data mining using C4.5 to distinguish cliparts from photos
- Tech Stack
- Weka, C#, C4.5
- The challenge
- Cliparts are images composed out of drawings. Given a data set of 1164 images, the goal was to distinguish clipart images from photos with an accuracy as high as possible.
- Weka the flightless bird
- Weka is a flightless bird species of the rail family. Just kidding, although that is true, Weka is also a Java application with a collection of machine learning algorithms for data mining tasks. This tool was used to create decision tree's with the C4.5 algorithm. The resulting decision tree was later implemented in a C# program. In order to create these tree's, input parameters were needed. This could be the amount of different colors, the amount of the same colors, the amount of transparent colors, etc. Other input parameters were the amount of peaks above a threshold on the histogram of the image or the amount of pixels of the same color next to each other. Eventually the resulting tree (see above) was very simple. It looks at the pixels on the corners and sides of the image. If some of these colors are in a tight range, they will be equalized. This method worked pretty well for the given data set. Of course, a photo with a border was considered as clipart as well.