机器学习/数据分析--通俗语言带你入门随机森林，并用随机森林进行天气分类预测(Accuracy为0.92)

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

前言

机器学习是深度学习和数据分析的基础，接下来将更新常见的机器学习算法及其案例
注意：在打数学建模比赛中，机器学习用的也很多，可以一起学习
欢迎收藏 + 点赞 + 关注

文章目录

1、简介
- 1、集成学习Bagging
- 2、随机森林简介
2、案例：不同天气分类
- 1、导入数据
- 2、数据检查和数据预处理
- 3、数据分析
- 4、模型创建
- - 1、标签编码
  - 2、模型创建
- 5、模型预测于评估
- 6、特征重要特征展示

1、简介

1、集成学习Bagging

Bagging集成核心思想：将数据集集随机分为N份，每一份用一个模型求解，最后将所有模型结果进行投票得出结果。

转化为图像如下：

在这里插入图片描述

自动采样法：

自动采样法，可以有放回的采样，假设m个样本的数据集，每一次随机拿去一个样本，然后放回，这样就有概率下一次再被选中，经过m次采样，一次大概有百分之63.2%(数学公式推导而出)的数据被选中。

数学公式推导：

假设每一个样本被选择的概率为 1/m，这样进行m次选择，没有选择的概率为：

$(1-\frac1m)^m$

当m-> $\infty$ 的时候，结果趋于： $\frac1e\approx0.368\text{ 。}$ 然后用1减去，得到的。

2、随机森林简介

随机森林是一种集成学习算法，将多个决策树按照Bagging思想进行集成，然后对每个决策树的结果进行投票，非常适合复杂分类的情况下处理数据，下图为随机森林大体结构：

在这里插入图片描述

2、案例：不同天气分类

简介：本项目使用了一个人工合成的天气数据集，模拟了雨天、晴天、多云和雪天四种类型

任务：对数进行数据分析，建立随机森林模型对天气类别进行分类预测

1、导入数据

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt data = pd.read_csv('weather_classification_data.csv')
data

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

names = ['温度', '湿度', '风速', '降水量(%)', '云量', '气压', '紫外线指数', '季节' ,'能见度(km)', '地点', '天气类型']
data.columns = names
data

	温度	湿度	风速	降水量(%)	云量	气压	紫外线指数	季节	能见度(km)	地点	天气类型
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

2、数据检查和数据预处理

# 查看是否有缺失值
data.isnull().sum()

温度         0
湿度         0
风速         0
降水量(%)     0
云量         0
气压         0
紫外线指数      0
季节         0
能见度(km)    0
地点         0
天气类型       0
dtype: int64

# 查看数据信息
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):#   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   温度       13200 non-null  float641   湿度       13200 non-null  int64
2   风速       13200 non-null  float643   降水量(%)   13200 non-null  float644   云量       13200 non-null  object5   气压       13200 non-null  float646   紫外线指数    13200 non-null  int64
7   季节       13200 non-null  object8   能见度(km)  13200 non-null  float649   地点       13200 non-null  object10  天气类型     13200 non-null  object
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB

# 分别对云量、季节、地点、天气类型进行分类
columns = ['云量', '季节', '地点', '天气类型']
for i in columns:print(f'{i}')print(data[i].unique())print('*' * 50)

云量
['partly cloudy' 'clear' 'overcast' 'cloudy']
**************************************************
季节
['Winter' 'Spring' 'Summer' 'Autumn']
**************************************************
地点
['inland' 'mountain' 'coastal']
**************************************************
天气类型
['Rainy' 'Cloudy' 'Sunny' 'Snowy']
**************************************************

分析：

云量：四类
季节：四类
地点：四类
天气类型：四类

# 纸箱图分析，对数据进行异常值分析
import seaborn as sns #设置字体
from pylab import mpl
mpl.rcParams["font.sans-serif"] = ["SimHei"]  # 显示中文
plt.rcParams['axes.unicode_minus'] = False		# 显示负号feature_map = {'温度': '温度','湿度': '湿度百分比','风速': '风速','降水量(%)': '降水量百分比','气压': '大气压力','紫外线指数': '紫外线指数','能见度(km)': '能见度'
}plt.figure(figsize=(15, 10))for i, (col, col_name) in enumerate(feature_map.items(), 1):     # 1 是索引从1开始，1、2、3、4……plt.subplot(2, 4, i)sns.boxplot(y=data[col])plt.title(f'{col_name}的纸箱图', fontsize=14)plt.ylabel('数值', fontsize=12)plt.grid(axis='y', linestyle='--', alpha=0.7)plt.tight_layout()   # 自动调整宽距
plt.show()

在这里插入图片描述

异常值分析

温度：高于60，违背常理，删去
湿度：存在超过100的值，删除
风速：风速影响因素很多，这里不做处理
降雨量：存在超过100的值，删除
大气压力：大气压力受到很多因素影响，如：高海拔引起，不处理
能见度：可能受到雾霾、雨季的影响，不处理

# 统计异常值占比
print('温度: ', data[data['温度'] > 60.0]['温度'].count() / data['温度'].count())
print('湿度: ', data[data['湿度'] > 100.0]['湿度'].count() / data['湿度'].count())
print('降水量(%): ', data[data['降水量(%)'] > 100.0]['降水量(%)'].count() / data['降水量(%)'].count())

温度:  0.015681818181818182
湿度:  0.03151515151515152
降水量(%):  0.029696969696969697

分析：

发现异常值占比极低，故删除

# 删除异常值
print(f'删除前数据维度{data.shape}')
data = data[(data['温度'] <= 60.0) & (data['湿度'] <= 100.0) & (data['降水量(%)'] <= 100.0)]
print(f'删除后数据维度{data.shape}')

删除前数据维度(13200, 11)
删除后数据维度(12360, 11)

3、数据分析

# 统计分析
data.describe()

	温度	湿度	风速	降水量(%)	气压	紫外线指数	能见度(km)
count	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000
mean	18.071359	66.937460	9.356837	50.864968	1005.713743	3.791262	5.535801
std	15.804363	19.390333	6.318334	30.967846	38.300471	3.720638	3.377554
min	-24.000000	20.000000	0.000000	0.000000	800.120000	0.000000	0.000000
25%	4.000000	56.000000	5.000000	19.000000	994.587500	1.000000	3.000000
50%	21.000000	69.000000	8.500000	54.000000	1007.495000	2.000000	5.000000
75%	30.000000	81.000000	13.000000	79.000000	1016.750000	6.000000	7.500000
max	60.000000	100.000000	48.500000	100.000000	1199.210000	14.000000	20.000000

# 对每个数据进行图示化展示，看** 处理后的数据 ** 是否正常
plt.figure(figsize=(20, 15))
plt.subplot(3, 4, 1)
sns.histplot(data['温度'], kde=True, bins=20)   # kde：直方图上绘制核密度曲线，bins：分为几个柱子
plt.title('温度分布')
plt.xlabel('温度')
plt.ylabel('频率')plt.subplot(3, 4, 2)
sns.boxplot(y=data['湿度'])
plt.title('湿度百分比图')
plt.ylabel('湿度占比')plt.subplot(3, 4, 3)
sns.histplot(data['风速'], kde=True, bins=20)
plt.title('风速分布')
plt.xlabel('风速(km/h)')
plt.ylabel('频率')plt.subplot(3, 4, 4)
sns.boxplot(y=data['降水量(%)'])
plt.title('降水量百分比纸箱图')
plt.ylabel('降水量占比')plt.subplot(3, 4, 5)
sns.countplot(x='云量', data=data)
plt.title('云量分布')
plt.xlabel('云量描述')
plt.ylabel('频率')plt.subplot(3, 4, 6)
sns.histplot(data['气压'], kde=True, bins=20)
plt.title('气压分布')
plt.xlabel('气压')
plt.ylabel('频率')plt.subplot(3, 4, 7)
sns.histplot(data['紫外线指数'], kde=True, bins=20)
plt.title('紫外线等级分布')
plt.xlabel('紫外线')
plt.ylabel('频率')plt.subplot(3, 4, 8)
season_counts = data['季节'].value_counts()
plt.pie(season_counts, labels=season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('季节分布')plt.subplot(3, 4, 9)
sns.histplot(data['能见度(km)'], kde=True, bins=20)
plt.title('能见度分布')
plt.xlabel('能见度km/h')
plt.ylabel('频率')plt.subplot(3, 4, 10)
sns.countplot(x='地点', data=data)
plt.title('地点分布')
plt.xlabel('地点')
plt.ylabel('频速')plt.subplot(3, 4, (11,12))
sns.countplot(x='天气类型', data=data)
plt.title('天气类型分布')
plt.xlabel('天气类型')
plt.ylabel('频数')plt.tight_layout()
plt.show()

在这里插入图片描述

数据分析

温度：>60度已经去除，主要集中在(-10, 5)，(10, 40)之间，符合常理
湿度：分布在百分之二十到百分之百之间，且数据集中在 中位数附件，符合常理
风速：集中在0-20之间，且集中分布，风速较低，极端风速极少，符合常理
降雨量：分布在0-100之间，主要集中在20-80，中位数大概在50左右，对称，能够反映大多数天气情况下的降雨量
云量：主要为 局部多云和 阴天比较多,多云最少
气压分布：极端情况少，主要集中在1000附件，符合常理
紫外线：大多数较低，高的占比较少，缝合常理
季节分布：冬天气温占比做多
能见度：能见度大多数集中在5KM附件，能见度正常
地点分布：反映了数据中不同地点天气的数量
天气类型：四种天气类型数量差不多，比较平均和对称

总的来说，数据处理后，数据没有问题，可以进行下一步处理

4、模型创建

1、标签编码

from sklearn.preprocessing import LabelEncoder
new_data = data.copy()
columns = ['云量', '季节', '地点', '天气类型']feture_name = {}for i in columns:le = LabelEncoder()new_data[i] = le.fit_transform(data[i])  feture_name[i] = le   # 每一个编码器，都是返回的是编码结果

# 展示对应的标签编码
for i in columns:print(i, ': ')for index, class_ in enumerate(feture_name[i].classes_):print(f'index: {index}: {class_}')

云量 :
index: 0: clear
index: 1: cloudy
index: 2: overcast
index: 3: partly cloudy
季节 :
index: 0: Autumn
index: 1: Spring
index: 2: Summer
index: 3: Winter
地点 :
index: 0: coastal
index: 1: inland
index: 2: mountain
天气类型 :
index: 0: Cloudy
index: 1: Rainy
index: 2: Snowy
index: 3: Sunny

2、模型创建

from sklearn.model_selection import train_test_split
# 数据集划分
X = new_data.drop('天气类型', axis=1)
y = new_data['天气类型']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier
# 创建随机森林模型
model = RandomForestClassifier()
# 模型的训练
model.fit(X_train, y_train)

5、模型预测于评估

from sklearn.metrics import classification_reporty_pred = model.predict(X_test)
model_evaluation = classification_report(y_test, y_pred)
print(model_evaluation)

precision    recall  f1-score   support0       0.89      0.93      0.91       6361       0.92      0.90      0.91       6222       0.94      0.93      0.93       6143       0.92      0.92      0.92       600accuracy                           0.92      2472macro avg       0.92      0.92      0.92      2472
weighted avg       0.92      0.92      0.92      2472

分析：
平均准确率、平均召回率、平均f1得分均在0.92，效果极好

6、特征重要特征展示

feature_importances = model.feature_importances_
feture_rf = pd.DataFrame({'特征': X.columns, '重要度': feature_importances})
feture_rf.sort_values(by='重要度', inplace=True, ascending=False)  # ascending=False：说明降序
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=feture_rf)
plt.title('特征影响程度')
plt.xlabel('特征重要占比')
plt.ylabel('特征名字')
plt.show()

在这里插入图片描述

从图中可以看出，温度、降水量、紫外线指数、能见度、气压影响因素最大。

机器学习/数据分析--通俗语言带你入门随机森林，并用随机森林进行天气分类预测(Accuracy为0.92)

文章目录

1、简介

1、集成学习Bagging

2、随机森林简介

2、案例：不同天气分类

1、导入数据

2、数据检查和数据预处理

3、数据分析

4、模型创建

1、标签编码

2、模型创建

5、模型预测于评估

6、特征重要特征展示

最新新闻

热搜词