Data Mining Parameters - Data Science
Datamining covers everything that are related with the data from collection of raw data to EDA and preparation of input to AI algorithm. We have lots of parameters for describing the data. Some of them we are going to discuss are Impurity index, Central of tendency, Eigenvalue/ Eigenvector, PCA in Classification.
1 Entropy
Entropy is a measure of impurity, disorder or uncertainty in a bunch of examples i.e. it is an indicator of how messy our data is. In Decision Trees, the goal is to tidy the data. Entropy controls how a Decision Tree decides to split the data. It affects how a Decision Tree draws its boundaries so that the outcomes from the algorithm will have purely classified objects.
Where,
S = The current dataset for which entropy is being calculated
X = Set of classes in S
p(x) = The probability of each set S
2 Gini
Impurity measures such as entropy and Gini index tend to favor attributes that have a large number of distinct values . If we consider the same example as in entropy, the gini index is computed using the following equation:
$$ G(S) = 1-\sum_{x\epsilon X} |p(x)|^2 $$
Where,
S = The current dataset for which entropy is being calculated
X = Set of classes in S
p(x) = The probability of each set S
3 Classification Error
Classification error is a measure of impurity at a node and defined for classification error at a node t as,
$$ Error(t) = 1 − maxP(i|t) $$
The classification error made by node ranges minimum 0 when all records belong to one class to maximum $$ (1 − 1/n_c ) $$ when records are equally distributed among all classes.
4 Covariance Matrix
Variance measures the variation of a single random variable (like the height of a person in a population), whereas covariance is a measure of how much two random variables vary together (like the height of a person and the weight of a person in a population). The covariance matrix can be calculated using covariance, which is a square matrix given by C I,j = σ(x i , x j ) where C ∈ R d xd and d describe dimension or number of random variables of the data (e.g. the number of features like height, width, weight, etc.). The calculation for the covariance matrix can be also expressed as:
$$ C = \frac{1}{n-1} \sum_{i=1} ^n (X_i-\overline{X} )(X_i-\overline{X} )^T $$
The covariance matrix for two dimensions is given by,
$$ \begin{pmatrix} \sigma(x,x) & \sigma(x,y) \\ \sigma(y,x) & \sigma(y,y) \end{pmatrix} $$
The covariance matrix is symmetric since $$ \sigma(x_i, x_j) = \sigma(x_j, x_i) $$.
5 Eigenvalue and Eigenvector
In linear algebra, an eigenvector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled. For linear equations:
$$ Av = λv $$
In this equation A is an n-by-n matrix, v is a non-zero n-by-1 vector and $ \lambda $ is a scalar (which may be either real or complex). Any value of $ \lambda $ for which this equation has a solution is known as eigenvalue of the matrix A. It is sometimes also called the characteristics value. The vector, v, which corresponds to this value is called an eigenvector. The eigen problem can be written as
$$ A. v − \lambda . v = 0 $$ $$ A. v − \lambda. I. v = 0 $$ $$ (A − \lambda. I). v = 0 $$
If v is non-zero, this equation will only have a solution if $$ |A − \lambda. I| = 0 $$ This equation is called the characteristic equation of A, and is an nth order polynomial in $\lambda$ with n roots. These roots are called the eigenvalues of A. We will only deal with the case of n distinct roots, though they may be repeated. For each eigenvalue, there will be an eigenvector for which the eigenvalue equation is true.
6 Distances
Euclidean distance is a measure of the distance between two points in Euclidean space. Mathematically,
$$ dist = \sqrt{\sum_{k=1}^n (p_k - q_k)^2} $$
Where n is the number of dimensions (attributes) and $p_k$ and $q_k$ are, respectively, the $k^th$ attributes (components) or data objects p and q. Minkowski Distance is a generalization of Euclidean distance and given as,
$$ dist = \left(\sum_{k=1}^n |p_k - q_k|^r \right)^{\frac{1}{r}} $$
Where r is a parameter, n is the number of dimensions (attributes) and $p_k$ and $q_k$ are, respectively, the k th attributes (components) or data objects p and q.
- r = 1, it becomes Manhattan distance.
- r = 2, it becomes Euclidean distance.
- $r \to \infty $, it becomes supremum distance.
7 Similarity
The similarity is the measure of how much alike two data
objects are. The similarity in a data mining context is usually
described as a distance with dimensions representing features
of the objects. A small distance indicating a high degree of
similarity and a large distance indicating a low degree of
similarity. The similarity is subjective and is highly dependent
on the domain and application.
Cosine Similarity of two document vectors is given as,
$$ cos(d_1, d_2) = \frac{d_1 . d_2}{||d_1||.||d_2||} $$
Where ||d|| is the length of vector d.
Cosine similarity is for comparing two real-valued vectors,
but Jaccard similarity is for comparing two binary vectors
(sets). Mathematically,
$$ J_g (a,b) = frac{sum_i min(a_i, b_i)}{sum_i max(a_i, b_i)} $$
For example, $$ t_1 = (1, 1,0,1), t_2 = (2,0,1,1)$$, the generalized Jaccard similarity index can be computed as follows:
$$ J(t_1, t_2) = \frac{1+0+0+1}{2+1+1+1} = 0.4 $$
8 PCA
Principal Component Analysis (PCA) is a feature extraction method that uses orthogonal linear projections to capture the underlying variance of the data. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. It reduces the dimension of the data with the aim of retaining as much information as possible. In other words, this method combines highly correlated variables to form a smaller number of an artificial set of variables which is called “principal components” that account for the most variance in the data.
9 CONCLUSION
The measure of central of tendency, similarity, etc. are the part of Exploratory Data Analysis (EDA). The EDA itself doesn’t give the model for prediction but extremely useful for getting the sense of information from data. This gives an idea about how to get started with the data. Impurity indices like Entropy, Gini, and Classification Error in the classification helps examine how classification algorithm struggles to classify the items based on their attributes. The impurity index helps find the depth of the decision tree algorithm.