Java程序辅导

C C++ Java Python Processing编程在线培训程序编写软件开发视频讲解

QQ：2653320439 微信：ittutor Email：itutor@qq.com

Intelligent Data Analysis Extended Assignment The data set analyzed for this exercise was the Wine recognition data provided as a potential data set. The data set describes the chemical analysis of wine grown in the same region of Italy but from three different cultivars. The three cultivars are described as three different classes in the data. The total number of datapoints is 178, split between the classes as follows. Class 1: 59 Class 2: 71 Class 3: 48 Each of these data points has 13 dimensions, besides their Class 1) Alcohol 2) Malic Acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10) Color intensity 11) Hue 12) OD280/OD315 of diluted wines 13) Proline The goal of this assignment was to use Principal Component Analysis in such a way that the dimensionality of the problem could be reduced whilst also having the new dimensions providing a noticeable distinction between the three classes. A preliminary look at the data To start with the data was imported into a spreadsheet, once this had been done the average and variance of both the whole set and of each individual class in the set was found to help give an idea of which factors may be most important in separating the classes. The standard deviation was also used for each class to help assess how separated the dimensions in the different classes are. The tables are shown on the next page. Whilst there are some clear differences in some of the averages of dimentions between the classes particularly Color intensity, Proline, Magnesium and Alcalinity of Ash. However in all these cases there is some overlap in the standard deviation so the classes cannot be easily separated by any one of these dimensions. Further looking at the averages values and particularly the variance values it can be seen that the scales are different enough that the data point values will have to have their dimensions averages (of the entire set) taken from them. This is quite standard for principle component analysis. However the different scales of the variance also means that the column will have to be normalised. This is performed by dividing each average adjusted data point by the standard deviation of the dimensions data values. Which by extension sets the variance to 1 as well. So to recap, to prepare the data for PCS, the values for each dimension of each data point has been normalised by subtracting the average of that dimensions data set and then dividing by the standard deviation of that dimensions data set. Averages and varience of data Page 1 Whole Data Set Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Avarage 13.001 2.336 2.367 19.495 99.742 2.295 2.029 0.362 1.591 5.058 0.957 2.612 746.893 Varience 0.659 1.248 0.075 11.153 203.989 0.392 0.998 0.015 0.328 5.374 0.052 0.504 99166.717 Averages Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Class 1 13.745 2.011 2.456 17.037 106.339 2.840 2.982 0.290 1.899 5.528 1.062 3.158 1115.712 Class 2 12.279 1.933 2.245 20.238 94.549 2.259 2.081 0.364 1.630 3.087 1.056 2.785 519.507 Class 3 13.154 3.334 2.437 21.417 99.313 1.679 0.781 0.448 1.154 7.396 0.683 1.684 629.896 Variences Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Class 1 0.214 0.474 0.052 6.484 110.228 0.115 0.158 0.005 0.170 1.534 0.014 0.128 49071.450 Class 2 0.289 1.031 0.100 11.221 280.680 0.297 0.498 0.015 0.362 0.855 0.041 0.247 24715.368 Class 3 0.281 1.184 0.034 5.099 118.602 0.127 0.086 0.015 0.167 5.340 0.013 0.074 13247.329 Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Class 1 0.462 0.689 0.227 2.546 10.499 0.339 0.397 0.070 0.412 1.239 0.116 0.357 221.521 Class 2 0.538 1.016 0.315 3.350 16.753 0.545 0.706 0.124 0.602 0.925 0.203 0.497 157.211 Class 3 0.530 1.088 0.185 2.258 10.890 0.357 0.294 0.124 0.409 2.311 0.114 0.272 115.097 Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Class 1 0.917 -0.292 0.325 -0.736 0.462 0.871 0.954 -0.577 0.539 0.203 0.458 0.769 1.171 Class 2 -0.889 -0.361 -0.444 0.223 -0.364 -0.058 0.052 0.015 0.069 -0.850 0.432 0.245 -0.722 Class 3 0.022 0.864 0.243 0.655 -0.124 -0.772 -0.958 0.615 -0.551 0.685 -1.054 -0.999 -0.465 Alcohol Malic Acid Ash Magnesium Flavanoids Proanthocyanins Hue Proline Class 1 0.324 0.380 0.686 0.581 0.540 0.293 0.158 0.317 0.518 0.285 0.260 0.253 0.495 Class 2 0.439 0.826 1.322 1.006 1.376 0.759 0.499 0.992 1.107 0.159 0.788 0.489 0.249 Class 3 0.427 0.948 0.453 0.457 0.581 0.325 0.086 0.995 0.510 0.994 0.251 0.147 0.134 Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Standard Deviation Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Averages (normalised) Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Variance (normalised) Alcalinity of ash Total Phenols Nonflavanoid phenols Color intensity OD280/OD315 of diluted wines Once the dataset had been normalised it was ready to be read in the program written to perform the PCS, which was written in java. The normalised data was read into the program as a csv file. Once the data set had been created in the program as a java object, a design matrix which contains the dataset was formed. The design matrix has the dimensions along one axis and the data point entries along the other. Therefore containing the dimension value of each point in the data set. Next a covariance matrix must be produced. As the average of each dimension set has been made to be zero in the preprocessing the covariance equation the covariance equation can be simplified to, Which means the covariance matrix can be found with Where is the design matrix. From here, due to the complexity of eigen decomposition on high dimensional systems, the external library Jama was used to compute the eigen decomposition of the covariance matrix, which supplied both a matrix made up of the eigenvectors and also a list of the corresponding eigenvalues. By looking at the size of the eigenvalues created by the eigen decomposition, the eigenvectors which will provide new projections which will provide a significant amount of the variation in far fewer dimensions can be found. Whilst this graph shows the differences in the eigen values and gives an idea of the significance of their eigenvectors in projection, summing these values and dividing each by the total we can arrive at a graph showing the percentage variation each of the representivie eigenvectors will describe. From this we can see that over 35% of the variation will be described by the first eigenvector projections and slightly under 20% will be described in the second eigenvector projection. Choosing the number of dimensions to reduce the data to it can be helpful to use a cumulative chart. From here we see that using just the first two eigenvector projections we capture around 55% of the variation described by the entire data set. Performing these projections on the data set and plotting the data points on an XY graph in this new plane will hopefully distinguish the classes of wine from each other in just two dimensions. The selected eigenvectors with the dimension each of their values represents are shown on the next page. Due to the way the eigen decomposition outputs the eigenvectors the the most significant principle component becomes the y axis and the next the x axis in this particular process. X (PC2) Y (PC1) Dimension 0.484 0.144 Alcohol 0.225 0.245 Malic Acid 0.316 0.002 Ash 0.011 0.239 Alcalinity of ash 0.300 0.142 Magnesium 0.065 0.395 Total Phenols 0.003 0.423 Flavanoids 0.029 0.299 Nonflavanoid phenols 0.039 0.313 Proanthocyanins 0.530 0.089 Color intensity 0.279 0.297 Hue 0.164 0.376 OD280/OD315 of diluted wines 0.365 0.287 Proline The dataset, or in this case the design matrix is then linearly multiplied by the transposition of these eigenvectors to produce the projected axes. As can be seen the principle component analysis does a fairly good job of separating the classes in two dimensions. With only a few data points overlapping into different class clusters. Looking at the data it seems that class two has the loosest grouping in these dimensions, so it is reasonable to assume that the overlap is due to class 2 drifting into the tighter groupings of 1 and 3. Looking at the new variences of the data set in these axes we get. X (PC2) Y (PC1) Sum Class 1 0.594 0.647 1.241 Class 2 0.672 1.565 2.236 Class 3 0.879 0.416 1.295 When combined it can be seen that class 2 has got the highest varience, however Class 3’s PC2 variance is higher than Class 2’s. Principal Component Makeup To look at what these principal component axes are made up of it we can take a look at the eigenvectors that created them. By taking the absolute value of each axis contribution and dividing each by the combination of these values a collection of percentage contributions of the old axes to these new dimensions can be found. Which are found below X(PC2) Y(PC1) Color intensity 18.86% Flavanoids 13.01% Alcohol 17.21% Total Phenols 12.14% Proline 12.99% OD280/OD315 of diluted wines 11.57% Ash 11.25% Proanthocyanins 9.64% Magnesium 10.66% Nonflavanoid phenols 9.18% Hue 9.94% Hue 9.13% Malic Acid 8.00% Proline 8.82% OD280/OD315 of diluted wines 5.85% Malic Acid 7.54% Total Phenols 2.31% Alcalinity of ash 7.36% Proanthocyanins 1.40% Alcohol 4.44% Nonflavanoid phenols 1.02% Magnesium 4.37% Alcalinity of ash 0.38% Color intensity 2.73% Flavanoids 0.12% Ash 0.06% As can be seen many of the previous dimensions significantly contribute to the new axes. And there is no clear singular dimension which dominates the new ones. 49 percent of PC2s contributions are from Color intensity, alcohol and proline. PC1s are more evenly spread with Flavanoids, Total Phenols and ‘OD280/OD315 of diluted wines’ providing the largest contributions.