Datasets 提供两种数据集对象:Dataset 和 ✨ IterableDataset ✨。
- Dataset 提供快速随机访问数据集中的行,并支持内存映射,因此即使加载大型数据集也只需较少的内存。
- IterableDataset 适用于超大数据集,甚至无法完全下载到磁盘或内存中。它允许在数据集完全下载之前就开始访问和使用数据集。
0 读取数据
from datasets import load_datasetdataset = load_dataset("rotten_tomatoes", split="train")
Dataset({features: ['text', 'label'],num_rows: 8530
1 Dataset
1.1 索引
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','label': 1}
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .','label': 0}
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
1.2 切片
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .','effective but too-tepid biopic'],'label': [1, 1, 1]}
2 IterableDataset
当设置 streaming=True
时加载的数据集为 IterableDataset:
IterableDataset 的行为与 Dataset 不同:
- 无法随机访问。
- 只能逐个迭代获取元素,例如使用
from datasets import load_datasetiter_dataset = load_dataset("rotten_tomatoes", split="train",streaming=True)
IterableDataset({features: ['text', 'label'],n_shards: 1
for i in iter_dataset:print(i)break
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
2.1 从现有 Dataset 创建 IterableDataset
for i in iter_dataset2:print(i)break
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
2.2 获取指定数量的示例
[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .','label': 1},{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .','label': 1},{'text': 'effective but too-tepid biopic', 'label': 1}]