我正在为我的一个处理数据的项目编写单元测试。但是,我有一些脚本可以获取 CSV,将它们与 Pandas 连接,然后随机采样它们以制作用于机器学习任务的训练/开发/测试集。
我正在编写生成一些随机数据 CSV 的单元测试,以供测试。但是,我如何为我要测试的脚本返回的内容创建参考数据?
# Example of my test setup:
@pytest.fixture
def create_reference_input_data():
# Create some random CSV strings and make some test input data CSVs
@pytest.fixture
def create_reference_output_data():
# create some fake output data from the data that was created in create_reference_input_data()
# this output data should be like what I am expecting from the script I am testing
# I will be using this data to assert to what is produced from the script I am testing.
return reference_train_df, reference_test_df, reference_dev_df
def test_collect_data(create_reference_output_data):
# Run the script that I am testing for. It generates randomly sampled data from concatenated CSV datas like what would be created in create_reference_input_data() fixture.
# CSV data to make train/test/dev splitted CSV data.
test_data = collect_data(input_path, output_path, test_split = .10, dev_split = .20)
for file1_row, file2_row in zip(reference_output_data, test_data):
assert file1_row == file2_row # assert lines of text are the same in reference and test
希望这个伪代码有意义。我了解播种和不了解。但是我怎样才能为我的脚本应该生成的内容手动创建一些测试数据,并断言它是我调用该脚本时实际生成的内容?