我应该如何使用 Python 的 NLTK 库进行词频统计？

2025/3/6 20:45:46 来源：https://blog.csdn.net/youyouxiong/article/details/139508064 浏览: 次关键词：我应该如何使用 Python 的 NLTK 库进行词频统计？

使用Python的NLTK（Natural Language Toolkit）库进行词频统计，你可以遵循以下步骤：

导入必要的模块：
在Python脚本或交互式环境中，导入NLTK库中的FreqDist类，以及其他可能需要的模块：

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

下载NLTK资源：
如果你第一次使用NLTK，可能需要下载一些资源，比如停用词列表：
```
nltk.download('punkt')
nltk.download('stopwords')
```

文本分词：
使用NLTK的word_tokenize函数对文本进行分词：

text = "This is an example sentence. This is another one!"
tokens = word_tokenize(text)

清洗文本：
移除标点符号和停用词，只保留有意义的词汇：

stop_words = set(stopwords.words('english'))
words = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]

计算词频：
使用Counter或NLTK的FreqDist来统计词频：

# 使用collections.Counter
word_counts = Counter(words)# 或者使用NLTK的FreqDist
freq_dist = nltk.FreqDist(words)

查看结果：
打印出最常见的词汇及其频率：

for word, count in word_counts.most_common(10):  # 或 freq_dist.most_common(10)print(f"{word}: {count}")

这是一个基本的流程，你可以根据需要调整分词、清洗和统计的步骤。例如，你可能需要根据你的语言环境调整停用词列表，或者添加额外的文本预处理步骤，如词干提取（stemming）或词形还原（lemmatization）。

最新新闻