2022高教社杯全国大学生数学建模竞赛C题问题二(2) Python代码

- - 2.2 对于每个类别选择合适的化学成分对其进行亚类划分，给出具体的划分方法及划分结果，并对分类结果的合理性和敏感性进行分析。
  - - 无监督变量筛选（聚类前筛选化学成分）
    - - 方差过滤法
      - 拉普拉斯分数
    - 聚类
    - - K-means聚类
      - 层次聚类
      - 高斯混合模型 GMM
      - Leiden莱顿聚类
      - Affinity Propagation 亲和力传播
      - Agglomerative Clustering 聚集聚类
      - BIRCH聚类
      - OPTICS 聚类
      - 谱聚类 Spectral Clustering
    - 有监督变量筛选（聚类后筛选化学成分）
    - - 随机森林度量特征重要性
      - 单变量特征选择 -- F 检验
      - 递归特征消除RFE

2.2 对于每个类别选择合适的化学成分对其进行亚类划分，给出具体的划分方法及划分结果，并对分类结果的合理性和敏感性进行分析。

import numpy as npd12_K = d12.iloc[np.where((d12['类型'] == '高钾'))[0],6:20]
d12_Pb = d12.iloc[np.where((d12['类型'] == '铅钡'))[0],6:20]
print(d12_K.shape)
print(d12_Pb.shape)

(18, 14)
(49, 14)

铅钡玻璃

# Min-Max Normalization
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
d12_Pb_norm = pd.DataFrame(min_max_scaler.fit_transform(d12_Pb), columns=list(d12_Pb.columns))

无监督变量筛选（聚类前筛选化学成分）

https://datascience.stackexchange.com/questions/29572/is-it-possible-to-do-feature-selection-for-unsupervised-machine-learning-problem

在这里插入图片描述

方差过滤法

https://scikit-learn.org/stable/modules/feature_selection.html

from sklearn.feature_selection import VarianceThresholddef variance_threshold_selector(data, threshold=0.5):selector = VarianceThreshold(threshold)selector.fit(data)return data[data.columns[selector.get_support(indices=True)]]d12_Pb_norm_var = variance_threshold_selector(d12_Pb_norm, 0.01)
print(d12_Pb_norm_var.shape)
d12_Pb_norm_var.head()

(49, 14)

	二氧化硅(SiO2)	氧化钾(K2O)	氧化钙(CaO)	氧化镁(MgO)	氧化铝(Al2O3)	氧化铁(Fe2O3)	氧化铜(CuO)	氧化铅(PbO)	氧化钡(BaO)	五氧化二磷(P2O5)	氧化锶(SrO)	二氧化硫(SO2)
0	0.453545	0.744681	0.365625	0.432234	0.380130	0.405229	0.024598	0.626006	0.000000	0.252654	0.169643	0.000000
1	0.228723	0.000000	0.231250	0.000000	0.064075	0.000000	0.984863	0.318174	0.880959	0.254069	0.330357	0.161755
2	0.012397	0.000000	0.498437	0.000000	0.047516	0.000000	0.297067	0.380069	0.863752	0.535032	0.473214	0.942320
3	0.416075	0.148936	0.548437	0.260073	0.161267	0.000000	0.466414	0.264160	0.412130	0.663836	0.330357	0.000000
4	0.361053	0.000000	0.457813	0.216117	0.224622	0.289760	0.332072	0.550320	0.150917	0.624912	0.169643	0.000000

拉普拉斯分数

Laplacian Score 是一个对一个训练集样本的特征进行打分的算法。通过这个算法可以给每一个特征打出一个分数，最后再取分数最低的k个特征作为最后选择的特征子集，是标准的 Filter 式方法。

https://github.com/ChiHangChen/LaplacianScore
https://jundongl.github.io/scikit-feature/tutorial.html

import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareformdef cal_lap_score(features, D, L):features_ = features - np.sum((features @ D) / np.sum(D))L_score = (features_ @ L @ features_) / (features_ @ D @ features_)return L_scoredef get_k_nearest(dist, k, sample_index):# dist is zero means it is the sample itselfreturn sorted(range(len(dist)),key=lambda i: dist[i] if i != sample_index else np.inf)[:k] + [sample_index]def laplacian_score(df_arr, label=None, **kwargs):kwargs.setdefault("k_nearest", 5)'''Construct distance matrix, dist_matrix, using euclidean distance'''distances = pdist(df_arr, metric='euclidean')dist_matrix = squareform(distances)del distances'''Determine the edge of each sample pairs by k nearest neighbor'''edge_sparse_matrix = pd.DataFrame(np.zeros((df_arr.shape[0], df_arr.shape[0]

2022高教社杯全国大学生数学建模竞赛C题问题二(2) Python代码

目录

2.2 对于每个类别选择合适的化学成分对其进行亚类划分，给出具体的划分方法及划分结果，并对分类结果的合理性和敏感性进行分析。

无监督变量筛选（聚类前筛选化学成分）

方差过滤法

拉普拉斯分数

最新新闻

热搜词

2022高教社杯全国大学生数学建模竞赛C题 问题二(2) Python代码

目录

2.2 对于每个类别选择合适的化学成分对其进行亚类划分，给出具体的划分方法及划分结果，并对分类结果的合理性和敏感性进行分析。

无监督变量筛选（聚类前筛选化学成分）

方差过滤法

拉普拉斯分数

最新新闻

热搜词

2022高教社杯全国大学生数学建模竞赛C题问题二(2) Python代码