4

我目前正在从事一个项目,我在该项目中生成 pandas DataFrames 作为分析结果。我在 Django 中进行开发,并希望在“结果”模型中使用“数据”字段来存储熊猫 DataFrame。

看来 HDF5(HDF Store) 是存储我的 pandas DataFrames 的最有效方式。但是,我不知道如何在我的模型中创建自定义字段来保存它。我将在下面展示简化的 views.py 和 models.py 来说明。

模型.py

class Result(model.Model):
    scenario = models.ForeignKey(Scenario)

    # HOW DO I Store HDFStore
    data = models.HDF5Field()

视图.py

class AnalysisAPI(View):
    model = Result    

    def get(self, request):
        request_dict = request.GET.dict()
        scenario_id = request_dict['scenario_id']
        scenario = Scenario.objects.get(pk=scenario_id)
        result = self.model.objects.get(scenario=scenario)
        analysis_results_df = result.data['analysis_results_df']

        return JsonResponse(
            analysis_results_df.to_json(orient="records")
        )

    def post(self, request):
        request_dict = request.POST.dict()
        scenario_id = request_dict['scenario_id']
        scenario = Scenario.objects.get(pk=scenario_id)
        record_list = request_dict['record_list']

        analysis_results_df = run_analysis(record_list)
        data = HDFStore('store.h5')
        data['analysis_results_df'] = analysis_results_df         

        new_result = self.model(scenario=scenario, data=data)
        new_result.save()

        return JsonResponse(
            dict(status="OK", message="Analysis results saved.")
        )

我很感激任何帮助,我也愿意接受另一种存储方法,例如 Pickle,只要我可以将它与 Django 一起使用,它具有类似的性能。

4

2 回答 2

1

您可以创建一个自定义模型字段,将数据保存到存储中的文件,并将相对文件路径保存到数据库。

以下是如何models.CharField在应用程序中进行子类化fields.py

import os

from django.core.exceptions import ValidationError
from django.core.files.storage import default_storage
from django.db import models
from django.utils.translation import gettext_lazy as _

class DataFrameField(models.CharField):
    """
    custom field to save Pandas DataFrame to the hdf5 file format
    as advised in the official pandas documentation:
    http://pandas.pydata.org/pandas-docs/stable/io.html#io-perf
    """

    attr_class = DataFrame

    default_error_messages = {
        "invalid": _("Please provide a DataFrame object"),
    }

    def __init__(
        self,
        verbose_name=None,
        name=None,
        upload_to="data",
        storage=None,
        unique_fields=[],
        **kwargs
    ):

        self.storage = storage or default_storage
        self.upload_to = upload_to
        self.unique_fields = unique_fields

        kwargs.setdefault("max_length", 100)
        super().__init__(verbose_name, name, **kwargs)

    def deconstruct(self):
        name, path, args, kwargs = super().deconstruct()
        if kwargs.get("max_length") == 100:
            del kwargs["max_length"]
        if self.upload_to != "data":
            kwargs["upload_to"] = self.upload_to
        if self.storage is not default_storage:
            kwargs["storage"] = self.storage
        kwargs["unique_fields"] = self.unique_fields
        return name, path, args, kwargs

__init__deconstruct方法非常受 Django 原始FileField的启发。还有一个附加unique_fields参数可用于创建可预测的唯一文件名。

    def from_db_value(self, value, expression, connection):
        """
        return a DataFrame object from the filepath saved in DB
        """
        if value is None:
            return value

        return self.retrieve_dataframe(value)

    def get_absolute_path(self, value):
        """
        return absolute path based on the value saved in the Database.
        """
        return self.storage.path(value)

    def retrieve_dataframe(self, value):
        """
        return the pandas DataFrame and add filepath as property to Dataframe
        """

        # read dataframe from storage
        absolute_filepath = self.get_absolute_path(value)
        dataframe = read_hdf(absolute_filepath)

        # add relative filepath as instance property for later use
        dataframe.filepath = value

        return dataframe

from_db_value您使用基于数据库中保存的文件路径的方法将 DataFrame 从存储加载到内存。

在检索 DataFrame 时,您还将文件路径作为实例属性添加到它,以便在将 DataFrame 保存回数据库时可以使用该值。

    def pre_save(self, model_instance, add):
        """
        save the dataframe field to an hdf5 field before saving the model
        """
        dataframe = super().pre_save(model_instance, add)

        if dataframe is None:
            return dataframe

        if not isinstance(dataframe, DataFrame):
            raise ValidationError(
                self.error_messages["invalid"], code="invalid",
            )

        self.save_dataframe_to_file(dataframe, model_instance)

        return dataframe

    def get_prep_value(self, value):
        """
        save the value of the dataframe.filepath set in pre_save
        """
        if value is None:
            return value

        # save only the filepath to the database
        if value.filepath:
            return value.filepath

    def save_dataframe_to_file(self, dataframe, model_instance):
        """
        write the Dataframe into an hdf5 file in storage at filepath
        """
        # try to retrieve the filepath set when loading from the database
        if not dataframe.get("filepath"):
            dataframe.filepath = self.generate_filepath(model_instance)

        full_filepath = self.storage.path(dataframe.filepath)

        # Create any intermediate directories that do not exist.
        # shamelessly copied from Django's original Storage class
        directory = os.path.dirname(full_filepath)
        if not os.path.exists(directory):
            try:
                if self.storage.directory_permissions_mode is not None:
                    # os.makedirs applies the global umask, so we reset it,
                    # for consistency with file_permissions_mode behavior.
                    old_umask = os.umask(0)
                    try:
                        os.makedirs(directory, self.storage.directory_permissions_mode)
                    finally:
                        os.umask(old_umask)
                else:
                    os.makedirs(directory)
            except FileExistsError:
                # There's a race between os.path.exists() and os.makedirs().
                # If os.makedirs() fails with FileExistsError, the directory
                # was created concurrently.
                pass
        if not os.path.isdir(directory):
            raise IOError("%s exists and is not a directory." % directory)

        # save to storage
        dataframe.to_hdf(full_filepath, "df", mode="w", format="fixed")

    def generate_filepath(self, instance):
        """
        return a filepath based on the model's class name, dataframe_field and unique fields
        """

        # create filename based on instance and field name
        class_name = instance.__class__.__name__

        # generate unique id from unique fields:
        unique_id_values = []
        for field in self.unique_fields:
            unique_field_value = getattr(instance, field)

            # get field value or id if the field value is a related model instance
            unique_id_values.append(
                str(getattr(unique_field_value, "id", unique_field_value))
            )

        # filename, for example: route_data_<uuid>.h5
        filename = "{class_name}_{field_name}_{unique_id}.h5".format(
            class_name=class_name.lower(),
            field_name=self.name,
            unique_id="".join(unique_id_values),
        )

        # generate filepath
        dirname = self.upload_to
        filepath = os.path.join(dirname, filename)
        return self.storage.generate_filename(filepath)

使用方法将 DataFrame 保存到 hdf5 文件,pre_save并将文件路径保存到数据库中get_prep_value.

在我的情况下,它有助于使用uuid模型字段来创建唯一的文件名,因为对于新模型实例,方法pk中尚不可用pre-save,但uuid值是。

然后,您可以在您的models.py:

from .fields import DataFrameField

# track data as a pandas DataFrame
data = DataFrameField(null=True, upload_to="data", unique_fields=["uuid"])

请注意,您不能在 Django 管理员或模型表单中使用此字段。这将需要在自定义表单小部件上进行额外的工作,以在前端编辑 DataFrame 内容,可能作为表格。

另请注意,对于测试,我必须MEDIA_ROOT使用tempfile使用临时目录覆盖设置,以防止在实际媒体文件夹中创建无用文件。

于 2020-01-01T16:20:11.357 回答
-1

它不是 HDF5,但请查看 picklefield:

from picklefield.fields import PickledObjectField

class Result(model.Model):
    scenario = models.ForeignKey(Scenario)

    data = PickledObjectField(blank=True, null=True)

https://pypi.python.org/pypi/django-picklefield

于 2016-02-11T19:30:34.927 回答