Data Mining is an interesting course I took this past semester, and really enjoyed. Today I want to talk about it a little, what it actually is, and how it's done. Basically data mining brings together subjects like machine learning, statistical analysis and basic programming in order to analyze data and obtain information from what is called a data set. A data set can be a spreadsheet, a database, an ASCII file or really any source of data, which is gathered into a single form, which is called a data warehouse.
It can contain data that's even been typed from hard copies or handwritten files. Basically someone gathers all the data needed about something, like records of different types of glass made, or blood samples of people and turns them into a data warehouse. The data then goes through a process called filtering and preprocessing which removes the useless or even noisy data. Useless data is essentially data that could have some details missing, or misplaced, duplicated or even be an outlier. These are all determined by many different statistical methods and measures. Then the data goes on to being processed by algorithms which create a "model" in the end. After that, the data is evaluated to measure whether the accuracy of said model and to determine if there's need for adjustments, or even use of another algorithm. There are many algorithms and measures for both model creation and evaluation.
The "model" created is essentially the information we need. It can be in many forms, such as a set of rules, decision trees, etc. The model can be created using many different methods, which usually take lots of time and calculation. There are many computer programs such as RapidMiner, WEKA and KNIME which all come in handy in different ways. These programs can take data sets, help the user filter them, preprocess them and eventually create the models.
But model creation is not the end of the line. You need to analyze the data by visualizing your data set, studying it and trying different measures in order to find the right algorithm. The algorithms that create the models in the end, use evaluation measures to evaluate the models created to determine how correct your deductions are. There are many measures for each data mining type.
Data mining can be done in many ways such as classification, regression, clustering, etc which all can be used for different data sets.
Classification basically takes a data set, which has many columns that are called attributes, and one or more columns it's trying to guess, which are called classes. For instance, let's say you have the information of a certain number of people's income, relationship status, bank account balances, etc and what you want to know is whether they cheated on their taxes or not. All you need is a data set that is already classified which means the column cheated-on-taxes needs to be filled for all the data. The classification measure then uses an algorithm to classify your data, and evaluate it.
Clustering on the other hand, has no classes, but it actually puts the different data in "clusters". For instance, let's assume your data set is documents and your attributes are all keywords. Based on said keywords, the clustering algorithm categorizes the documents into 3 clusters called for instance, Legal Documents, Letters, and News Articles. Basically each algorithm tries to form the clusters in a way that the items inside each cluster have the most possible similarity in terms of attributes and the clusters themselves have the least amount of similarity.
I'll write another article in the future about data mining using programs and machine learning.