西安是哪个省市_界面设计的重要性_专业搜索引擎seo服务商_如何创建网站

1.项目背景

随着现代农业的发展，对植物生长过程中环境因素的影响有了越来越多的关注，基于2023年8月3日至2023年9月19日期间记录的70个不同生菜样本的生长数据进行分析，可以更好地理解温度、湿度、pH值和总溶解固体（TDS）等环境条件如何影响生菜的生长动态。

2.数据说明

字段	说明
Plant_ID	植物编号
Date	日期
Temperature (°C)	温度（摄氏度）
Humidity (%)	湿度（百分比）
TDS Value (ppm)	TDS值（ppm）
pH Level	pH值
Growth Days	生长天数
Temperature (F)	温度（华氏度）
Humidity	湿度

3.Python库导入及数据读取

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from scipy.stats import spearmanr,mannwhitneyu
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,roc_curve, auc

data = pd.read_csv("/home/mw/input/12228739/lettuce_dataset_updated.csv")

直接执行这个代码的话，会报错：

Image Name
文件的编码格式与默认的 UTF-8 编码不兼容。为了正确加载数据，需要尝试不同的格式，一般常见的编码格式有：UTF-8、UTF-16、ISO-8859-1、GB2312 / GBK / GB18030、BIG5、ASCII。

data = pd.read_csv("/home/mw/input/12228739/lettuce_dataset_updated.csv", encoding='ISO-8859-1')

4.数据预览及预处理

print('查看数据信息:')
data.info()

查看数据信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3169 entries, 0 to 3168
Data columns (total 9 columns):#   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  0   Plant_ID          3169 non-null   int64  1   Date              3169 non-null   object 2   Temperature (°C)  3169 non-null   float643   Humidity (%)      3169 non-null   int64  4   TDS Value (ppm)   3169 non-null   int64  5   pH Level          3169 non-null   float646   Growth Days       3169 non-null   int64  7   Temperature (F)   3169 non-null   float648   Humidity          3169 non-null   float64
dtypes: float64(4), int64(4), object(1)
memory usage: 222.9+ KB

print(f'查看重复值:{data.duplicated().sum()}')

查看重复值:0

由于温度和湿度均有两列数据，只需要保留其中一列即可，这里选择保留百分比的湿度和摄氏度情况。

data['Date'] = pd.to_datetime(data['Date'])

data = data.drop(columns=['Temperature (F)','Humidity'])

feature_map = {'Temperature (°C)': '温度（摄氏度）','Humidity (%)': '湿度（百分比）','TDS Value (ppm)': 'TDS值（ppm）','pH Level': 'pH值','Growth Days': '生长天数'
}
plt.figure(figsize=(15, 10))
for i, (col, col_name) in enumerate(feature_map.items(), 1):plt.subplot(2, 3, i)sns.boxplot(y=data[col])plt.title(f'{col_name}的箱线图', fontsize=14)plt.ylabel('数值', fontsize=12)plt.grid(axis='y', linestyle='--', alpha=0.7)plt.tight_layout()
plt.show()

数据干净，不存在异常值，由于原数据中并不能很好的体现不同环境下生菜生长需要的天数，所以需要通过对Plant_ID进行分组来研究。

# 按 Plant_ID 分组，计算所需统计指标
plant_stats = data.groupby('Plant_ID').agg({'Temperature (°C)': ['min', 'max', 'mean'],'Humidity (%)': ['min', 'max', 'mean'],'TDS Value (ppm)': ['min', 'max', 'mean'],'pH Level': ['min', 'max', 'mean'],'Growth Days': 'max'
})# 重命名列，便于理解
plant_stats.columns = ['最低温度 (°C)', '最高温度 (°C)', '平均温度 (°C)','最低湿度 (%)', '最高湿度 (%)', '平均湿度 (%)','最低TDS值 (ppm)', '最高TDS值 (ppm)', '平均TDS值 (ppm)','最低pH值', '最高pH值', '平均pH值','最高生长天数'
]plant_stats.reset_index(inplace=True)
plant_stats = plant_stats.round({'平均温度 (°C)': 1, '平均湿度 (%)': 0, '平均TDS值 (ppm)': 0, '平均pH值': 1})
plant_stats.head()

	Plant_ID	最低温度 (°C)	最高温度 (°C)	平均温度 (°C)	最低湿度 (%)	最高湿度 (%)	平均湿度 (%)	最低TDS值 (ppm)	最高TDS值 (ppm)	平均TDS值 (ppm)	最低pH值	最高pH值	平均pH值	最高生长天数
0	1	20.1	33.5	30.6	50	79	64.0	416	789	615.0	6.0	6.8	6.4	45
1	2	20.1	33.5	30.6	50	79	65.0	413	797	597.0	6.0	6.8	6.4	45
2	3	20.1	33.5	30.6	50	80	69.0	403	799	620.0	6.0	6.8	6.4	47
3	4	20.1	33.5	30.6	50	78	63.0	402	794	598.0	6.0	6.8	6.4	48
4	5	20.1	33.5	30.6	50	80	65.0	410	778	577.0	6.0	6.8	6.4	45

5.描述性统计

data.describe()

	Plant_ID	Date	Temperature (°C)	Humidity (%)	TDS Value (ppm)	pH Level	Growth Days
count	3169.000000	3169	3169.000000	3169.000000	3169.000000	3169.000000	3169.000000
mean	35.441780	2023-08-25 03:23:07.062164480	28.142222	64.873462	598.045440	6.399211	23.141054
min	1.000000	2023-08-03 00:00:00	18.000000	50.000000	400.000000	6.000000	1.000000
25%	18.000000	2023-08-14 00:00:00	23.600000	57.000000	498.000000	6.200000	12.000000
50%	35.000000	2023-08-25 00:00:00	30.200000	65.000000	593.000000	6.400000	23.000000
75%	53.000000	2023-09-05 00:00:00	31.500000	73.000000	699.000000	6.600000	34.000000
max	70.000000	2023-09-19 00:00:00	33.500000	80.000000	800.000000	6.800000	48.000000
std	20.243433	NaN	4.670521	8.988985	115.713047	0.234418	13.077107

样本量与时间范围：共 3169 条记录，时间跨度为 2023年8月3日至2023年9月19日，涵盖48天的生长记录。
温度 (°C)：范围 18.0°C 至 33.5°C，平均值 28.14°C，标准差 4.67°C，总体处于生菜适宜生长的范围内。
湿度 (%)：范围 50% 至 80%，平均值 64.87%，标准差 8.99%，湿度波动较大，可能与灌溉或环境调控有关。
TDS 值 (ppm)：范围 400 至 800 ppm，平均值 598.05 ppm，标准差 115.71 ppm，可能受施肥或水质影响。
pH 值：范围 6.0 至 6.8，平均值 6.40，标准差 0.23，整体稳定，均处于生菜适宜的弱酸性范围。
生长天数：范围 1 天至 48 天，平均值 23.14 天，标准差 13.08 天，反映不同生菜生长阶段的差异。

plant_stats.describe()

	Plant_ID	最低温度 (°C)	最高温度 (°C)	平均温度 (°C)	最低湿度 (%)	最高湿度 (%)	平均湿度 (%)	最低TDS值 (ppm)	最高TDS值 (ppm)	平均TDS值 (ppm)	最低pH值	最高pH值	平均pH值	最高生长天数
count	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.000000
mean	35.500000	19.561429	31.251429	28.122857	50.185714	79.614286	64.885714	408.957143	791.885714	597.985714	6.001429	6.788571	6.397143	45.271429
std	20.351085	0.890578	3.794150	4.091142	0.391684	0.687209	1.450066	9.983226	8.400384	16.055603	0.011952	0.032046	0.023905	0.700340
min	1.000000	18.000000	24.600000	21.000000	50.000000	77.000000	63.000000	400.000000	765.000000	567.000000	6.000000	6.700000	6.300000	45.000000
25%	18.250000	18.325000	26.650000	21.950000	50.000000	79.000000	64.000000	402.000000	789.000000	584.250000	6.000000	6.800000	6.400000	45.000000
50%	35.500000	20.100000	33.500000	30.600000	50.000000	80.000000	65.000000	405.000000	795.000000	598.000000	6.000000	6.800000	6.400000	45.000000
75%	52.750000	20.100000	33.500000	30.600000	50.000000	80.000000	66.000000	412.000000	798.000000	611.750000	6.000000	6.800000	6.400000	45.000000
max	70.000000	20.100000	33.500000	30.700000	51.000000	80.000000	69.000000	449.000000	800.000000	628.000000	6.100000	6.800000	6.500000	48.000000

样本量与数据描述：共 70 条记录，涵盖不同生菜种植样本的生长数据。
温度 (°C)：
- 最低温度：范围 18.0°C 至 20.1°C，平均值 19.56°C，标准差 0.89°C。
- 最高温度：范围 24.6°C 至 33.5°C，平均值 31.25°C，标准差 3.79°C。
- 平均温度：范围 21.0°C 至 30.7°C，平均值 28.12°C，标准差 4.09°C。
湿度 (%)：
- 最低湿度：范围 50% 至 51%，平均值 50.19%，标准差 0.39%。
- 最高湿度：范围 77% 至 80%，平均值 79.61%，标准差 0.69%。
- 平均湿度：范围 63% 至 69%，平均值 64.89%，标准差 1.45%。
TDS 值 (ppm)：
- 最低TDS值：范围 400 至 449 ppm，平均值 408.96 ppm，标准差 9.98 ppm。
- 最高TDS值：范围 765 至 800 ppm，平均值 791.89 ppm，标准差 8.40 ppm。
- 平均TDS值：范围 567 至 628 ppm，平均值 597.99 ppm，标准差 16.06 ppm。
pH 值：
- 最低pH值：范围 6.0 至 6.1，平均值 6.00，标准差 0.01。
- 最高pH值：范围 6.7 至 6.8，平均值 6.79，标准差 0.03。
- 平均pH值：范围 6.3 至 6.5，平均值 6.40，标准差 0.02。
生长天数：范围 45 天至 48 天，平均值 45.27 天，标准差 0.70 天，样本间差异较小。

6.斯皮尔曼相关性分析

def plot_spearmanr(data,features,title,wide,height):# 计算斯皮尔曼相关性矩阵和p值矩阵spearman_corr_matrix = data[features].corr(method='spearman')pvals = data[features].corr(method=lambda x, y: spearmanr(x, y)[1]) - np.eye(len(data[features].columns))# 转换 p 值为星号def convert_pvalue_to_asterisks(pvalue):if pvalue <= 0.001:return "***"elif pvalue <= 0.01:return "**"elif pvalue <= 0.05:return "*"return ""# 应用转换函数pval_star = pvals.applymap(lambda x: convert_pvalue_to_asterisks(x))# 转换成 numpy 类型corr_star_annot = pval_star.to_numpy()# 定制 labelscorr_labels = spearman_corr_matrix.to_numpy()p_labels = corr_star_annotshape = corr_labels.shape# 合并 labelslabels = (np.asarray(["{0:.2f}\n{1}".format(data, p) for data, p in zip(corr_labels.flatten(), p_labels.flatten())])).reshape(shape)# 绘制热力图fig, ax = plt.subplots(figsize=(height, wide), dpi=100, facecolor="w")sns.heatmap(spearman_corr_matrix, annot=labels, fmt='', cmap='coolwarm',vmin=-1, vmax=1, annot_kws={"size":10, "fontweight":"bold"},linecolor="k", linewidths=.2, cbar_kws={"aspect":13}, ax=ax)ax.tick_params(bottom=False, labelbottom=True, labeltop=False,left=False, pad=1, labelsize=12)ax.yaxis.set_tick_params(labelrotation=0)# 自定义 colorbar 标签格式cbar = ax.collections[0].colorbarcbar.ax.tick_params(direction="in", width=.5, labelsize=10)cbar.set_ticks([-1, -0.5, 0, 0.5, 1])cbar.set_ticklabels(["-1.00", "-0.50", "0.00", "0.50", "1.00"])cbar.outline.set_visible(True)cbar.outline.set_linewidth(.5)plt.title(title)plt.show()

features = plant_stats.drop(['Plant_ID'],axis=1).columns.tolist()
plot_spearmanr(plant_stats,features,'各变量之间的斯皮尔曼相关系数热力图',12,15)

生长天数与温度、湿度、TDS值、pH值均无显著的相关关系。

7.Mann-Whitney U检验

通过之前的分析发现，生长天数大多是在45天，如果超过45天，视为异常样本。

plant_stats['异常样本'] = (plant_stats['最高生长天数'] > 45).astype(int)

def mann_whitney_u_test(df, features, target_column):results = []# 遍历每一个特征，进行Mann-Whitney U检验for feature in features:# 获取目标变量为 0 和 1 的数据group0 = df[df[target_column] == 0][feature]group1 = df[df[target_column] == 1][feature]# 进行 Mann-Whitney U 检验u_stat, p_val = mannwhitneyu(group0, group1)# 将结果保存results.append([feature, u_stat, p_val])# 将结果转换为 DataFrameresults_df = pd.DataFrame(results, columns=['Feature', 'U Statistic', 'P Value'])return results_df

# 获取 Mann-Whitney U 检验的结果
results_df = mann_whitney_u_test(plant_stats, features[:-1], '异常样本')
results_df

	Feature	U Statistic	P Value
0	最低温度 (°C)	322.5	0.975307
1	最高温度 (°C)	320.5	0.942425
2	平均温度 (°C)	275.0	0.336614
3	最低湿度 (%)	326.0	0.980887
4	最高湿度 (%)	256.0	0.165906
5	平均湿度 (%)	354.5	0.626596
6	最低TDS值 (ppm)	324.5	1.000000
7	最高TDS值 (ppm)	263.0	0.323278
8	平均TDS值 (ppm)	278.0	0.457655
9	最低pH值	330.0	0.694663
10	最高pH值	315.5	0.803445
11	平均pH值	313.5	0.673545

所有特征的p值均大于0.05，认为温度、湿度、TDS值、pH值对生菜的影响均不显著。

8.随机森林模型

x = plant_stats[features[:-1]]
y = plant_stats['异常样本']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=15) #37分

smote = SMOTE(sampling_strategy='auto', random_state=15)
x_train_res, y_train_res = smote.fit_resample(x_train, y_train)

rf_molde = RandomForestClassifier(random_state=15)
rf_molde.fit(x_train_res, y_train_res)

y_pred_rf = rf_molde.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print('随机森林模型评估如下：')
print(class_report_rf)

随机森林模型评估如下：precision    recall  f1-score   support0       0.89      0.89      0.89        181       0.33      0.33      0.33         3accuracy                           0.81        21macro avg       0.61      0.61      0.61        21
weighted avg       0.81      0.81      0.81        21

cm = confusion_matrix(y_test,y_pred_rf)plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=['预测值 0', '预测值 1'], yticklabels=['真实值 0', '真实值 1'])
plt.title('随机森林模型预测的混淆矩阵')
plt.show()

#绘制ROC曲线
fpr, tpr, _ = roc_curve(y_test, y_pred_rf)
roc_auc = auc(fpr, tpr)plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC曲线(面积 = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('假阳率')
plt.ylabel('真阳率')
plt.title('随机森林的ROC曲线')
plt.legend(loc="lower right")
plt.show()

模型预测效果一般，不算特别好，尤其在识别生长天数大于45天的样本。

feature_importances = rf_molde.feature_importances_
features_rf = pd.DataFrame({'特征': x.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()

TDS是模型分类中最重要的特征，至于为啥之前的斯皮尔曼相关性分析、Mann-Whitney U检验没有捕捉到这个信息，是因为随机森林可以捕捉非线性关系，而斯皮尔曼相关性和 U 检验是基于单调关系或单变量的显著性分析。

9.总结

针对2023年8月3日至2023年9月19日期间70个不同生菜样本的研究分析得出以下结论：
通过斯皮尔曼相关性分析和 Mann-Whitney U 检验发现，温度、湿度、TDS值及pH值均未对生菜生长天数产生显著影响，然而，在对数据进行 SMOTE 过采样以平衡样本后，随机森林模型在识别异常样本（生长天数大于45天）方面表现仍较差，但整体 AUC 达到 0.61，模型表现尚可，并且分析显示 TDS 值在随机森林模型中起到了较为重要的作用。