python - 在python中摄取数据一次

Question

我在 python 中有一个数据框，其中包含我用于二进制分类的所有数据。我在两次迭代中摄取数据——一次是一个类的所有数据，然后是另一个类的所有数据。然后我对行进行随机化。我遇到的问题是每次重新运行脚本时，都会重新创建数据框的行并随机创建不可重现的结果。

我应该从外部文件运行数据框创建和随机化吗？在模型构建中是否有关于数据摄取的常见做法？

在这方面我没有尝试过任何尝试。我还想知道从统计的角度或惯例来看这样做是否有意义？我会尝试以下方法：

import data_ingest
data_ingest.function_data_call()

但是，每次我运行脚本时，它也会调用形成数据并将其随机化的外部脚本。所以这不是我正在寻找的解决方案。

我不能真正展示一个例子，我正在加载文档（文本文件） - 文档二进制分类。数据框的结构如下：

row|           content         | class
--------------------------------------
1  | the sky is blue           | 0
2  | the river runs deep purple| 0
3  | yellow fever              | 0
4  | red strawberries          | 1
5  | black orchids are nice    | 1

摄取代码：

for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
   with io.open(path1+f, "r", encoding="utf-8") as myfile:
     # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
     tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data1.append(" ".join(tmp1.split()))

df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"

for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
   with io.open(path1+f, "r", encoding="utf-8") as myfile:
     # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
     tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data1.append(" ".join(tmp1.split()))

df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"

for f in [f for f in os.listdir(path2) if not f.startswith('.')]:
   with io.open(path2+f, "r", encoding="utf-8") as myfile:
     # data2.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '').replace(' ', ''))
     tmp2 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data2.append(" ".join(tmp2.split()))

df2 = pd.DataFrame(data2, columns=["content"])
df2["class"] = "0"

### Concatenate the two DataFrame into One and Re-Index
emails = pd.concat([df1,df2], ignore_index=True)

## Randomize Rows 
emails = emails.reindex(np.random.permutation(emails.index))

score 1 · Accepted Answer

如果要在（伪）随机化后重现相同的结果，可以设置随机种子。每次使用相同的种子时，都会得到相同的随机数序列。

其次，您可以将中间结果保存到文件、JSON 或pickle中。您可以检查它是否已经存在，如果不存在，请重新创建它。

python - 在python中摄取数据一次

1 回答 1

Related

Reference