目录
- 2.2 对于每个类别选择合适的化学成分对其进行亚类划分,给出具体的划分方法及划分结果,并对分类结果的合理性和敏感性进行分析。
- 无监督变量筛选(聚类前筛选化学成分)
- 方差过滤法
- 拉普拉斯分数
- 聚类
- K-means聚类
- 层次聚类
- 高斯混合模型 GMM
- Leiden莱顿聚类
- Affinity Propagation 亲和力传播
- Agglomerative Clustering 聚集聚类
- BIRCH聚类
- OPTICS 聚类
- 谱聚类 Spectral Clustering
- 有监督变量筛选(聚类后筛选化学成分)
- 随机森林度量特征重要性
- 单变量特征选择 -- F 检验
- 递归特征消除RFE
2.2 对于每个类别选择合适的化学成分对其进行亚类划分,给出具体的划分方法及划分结果,并对分类结果的合理性和敏感性进行分析。
import numpy as npd12_K = d12.iloc[np.where((d12['类型'] == '高钾'))[0],6:20]
d12_Pb = d12.iloc[np.where((d12['类型'] == '铅钡'))[0],6:20]
print(d12_K.shape)
print(d12_Pb.shape)
(18, 14)
(49, 14)
铅钡玻璃
# Min-Max Normalization
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
d12_Pb_norm = pd.DataFrame(min_max_scaler.fit_transform(d12_Pb), columns=list(d12_Pb.columns))
无监督变量筛选(聚类前筛选化学成分)
https://datascience.stackexchange.com/questions/29572/is-it-possible-to-do-feature-selection-for-unsupervised-machine-learning-problem
方差过滤法
https://scikit-learn.org/stable/modules/feature_selection.html
from sklearn.feature_selection import VarianceThresholddef variance_threshold_selector(data, threshold=0.5):selector = VarianceThreshold(threshold)selector.fit(data)return data[data.columns[selector.get_support(indices=True)]]d12_Pb_norm_var = variance_threshold_selector(d12_Pb_norm, 0.01)
print(d12_Pb_norm_var.shape)
d12_Pb_norm_var.head()
(49, 14)
二氧化硅(SiO2) | 氧化钠(Na2O) | 氧化钾(K2O) | 氧化钙(CaO) | 氧化镁(MgO) | 氧化铝(Al2O3) | 氧化铁(Fe2O3) | 氧化铜(CuO) | 氧化铅(PbO) | 氧化钡(BaO) | 五氧化二磷(P2O5) | 氧化锶(SrO) | 氧化锡(SnO2) | 二氧化硫(SO2) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.453545 | 0.0 | 0.744681 | 0.365625 | 0.432234 | 0.380130 | 0.405229 | 0.024598 | 0.626006 | 0.000000 | 0.252654 | 0.169643 | 0.0 | 0.000000 |
1 | 0.228723 | 0.0 | 0.000000 | 0.231250 | 0.000000 | 0.064075 | 0.000000 | 0.984863 | 0.318174 | 0.880959 | 0.254069 | 0.330357 | 0.0 | 0.161755 |
2 | 0.012397 | 0.0 | 0.000000 | 0.498437 | 0.000000 | 0.047516 | 0.000000 | 0.297067 | 0.380069 | 0.863752 | 0.535032 | 0.473214 | 0.0 | 0.942320 |
3 | 0.416075 | 0.0 | 0.148936 | 0.548437 | 0.260073 | 0.161267 | 0.000000 | 0.466414 | 0.264160 | 0.412130 | 0.663836 | 0.330357 | 0.0 | 0.000000 |
4 | 0.361053 | 0.0 | 0.000000 | 0.457813 | 0.216117 | 0.224622 | 0.289760 | 0.332072 | 0.550320 | 0.150917 | 0.624912 | 0.169643 | 0.0 | 0.000000 |
拉普拉斯分数
Laplacian Score 是一个对一个训练集样本的特征进行打分的算法。通过这个算法可以给每一个特征打出一个分数,最后再取分数最低的k个特征作为最后选择的特征子集,是标准的 Filter 式方法。
-
https://github.com/ChiHangChen/LaplacianScore
-
https://jundongl.github.io/scikit-feature/tutorial.html
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareformdef cal_lap_score(features, D, L):features_ = features - np.sum((features @ D) / np.sum(D))L_score = (features_ @ L @ features_) / (features_ @ D @ features_)return L_scoredef get_k_nearest(dist, k, sample_index):# dist is zero means it is the sample itselfreturn sorted(range(len(dist)),key=lambda i: dist[i] if i != sample_index else np.inf)[:k] + [sample_index]def laplacian_score(df_arr, label=None, **kwargs):kwargs.setdefault("k_nearest", 5)'''Construct distance matrix, dist_matrix, using euclidean distance'''distances = pdist(df_arr, metric='euclidean')dist_matrix = squareform(distances)del distances'''Determine the edge of each sample pairs by k nearest neighbor'''edge_sparse_matrix = pd.DataFrame(np.zeros((df_arr.shape[0], df_arr.shape[0]