13

Context: I'm currently using Python to a code a data-reduction pipeline for a large astronomical imaging system. The main pipeline class passes experimental data through a number of discrete processing 'stages'.

The stages are written in separate .py files which constitute a package. A list of available stages is generated at runtime so the user can choose which stages to run the data through. The aim of this approach is to allow the user to create additional stages in the future.

Issue: All of the pipeline configuration parameters and data structures are (currently) located within the main pipeline class. Is there a simple way to access these from within the stages which are imported at runtime?

My current best attempt seems 'wrong' and somewhat primitive, as it uses circular imports and class variables. Is there perhaps a way for a pipeline instance to pass a reference to itself as an argument to each of the stages it calls?

This is my first time coding a large python project and my lack of design knowledge is really showing.

Any help would be greatly appreciated.

4

4 回答 4

10

I've built a similar system; it's called collective.transmogrifier. One of these days I'll make it more generic (it is currently tied to the CMF, one of the underpinnings of Plone).

Decoupling

What you need, is a way to decouple the component registration for your pipeline. In Transmogrifier, I use the Zope Component Architucture (embodied in the zope.component package). The ZCA lets me register components that implement a given interface and later look up those components as either a sequence or by name. There are other ways of doing this too, for example, python eggs have the concept of entry points.

The point is more that each component in the pipeline is referable by a text-only name, de-referenced at construction time. 3rd-party components can be slotted in for re-use by registering their own components independently from your pipeline package.

Configuration

Transmogrifier pipelines are configured using a textual format based on the python ConfigParser module, where different components of the pipeline are named, configured, and slotted together. When constructing the pipeline, each section thus is given a configuration object. Sections don't have to look up configuration centrally, the are configured on instantiation.

Central state

I also pass in a central 'transmogrifier' instance, which represents the pipeline. If any component needs to share per-pipeline state (such as caching a database connection for re-use between components), they can do so on that central instance. So in my case, each section does have a reference to the central pipeline.

Individual components and behaviour

Transmogrifier pipeline components are generators, that consume elements from a preceding component in the pipeline, then yield the results of their own processing. Components generally thus have a reference to the previous stage, but have no knowledge of what consumes their output. I say 'generally' because in Transmogrifier some pipeline elements can produce elements from an external source instead of using a previous element.

If you do need to alter the behaviour of a pipeline component based on individual items to be processed, mark those items themselves with extra information for each component to discover. In Transmogrifier, items are dictionaries, and you can add extra keys to a dictionary that use the name of a component so each component can look for this extra info and alter behaviour as needed.

Summary

  • Decouple your pipeline components by using an indirect lookup of elements based on a configuration.

  • When you instantiate your components, configure them at the same time and give them what they need to do their job. That could include a central object to keep track of pipeline-specific state.

  • When running your pipeline, only pass through items to process, and let each component base it's behaviour on that individual item only.

于 2012-07-30T08:50:57.927 回答
3

Ruffus is a python library "designed to allow scientific and other analyses to be automated with the minimum of fuss and the least effort".

Positive: It allows incremental processing of data and you can define very complicated sequences. Additionally, tasks are automatically parallelized. It allows you to switch on/off functions and their order is defined automatically from the patterns you specify.

Negative: it is sometimes too pythonic for my taste, it only switches and orders functions and not, for example, classes. But then of course you can have the code to initialize the classes within each function.

For the purpose you want, you use the @active_if identifier above a function, to enable it or disable it over the pipeline. You can retrieve whether it is going to be activated from an external configuration file, which you read with a ConfigParser.

In order to load the ConfigParser values, you have to write another python module, which initializes the ConfigParser instance. This module has to be imported on the first lines of the pipeline module.

于 2013-05-31T14:07:13.017 回答
2

Two options:

  1. Have configuration somewhere else: have a config module, and use something like the django config system to make that available.
  2. Instead of having the stages import the pipeline class, pass them a pipeline instance on instantiation.
于 2012-07-29T21:04:12.867 回答
2

A colleague of mine has worked on a similar pipeline for astrophysical synthetic emission maps from simulation data (svn checkout https://svn.gforge.hlrs.de/svn//opensesame).

The way he does this is:

The config lives in a separate object (actually a dictionary as in your case).

The stages are either:

  • receive the config object at instantiation as a constructor argument
  • get the config through assignment later on (e.g. stage.config = config_object)
  • receive the config object as an argument when executed (e.g. stage.exec(config_object, other_params))
于 2012-07-30T14:42:18.793 回答