2

New to Neo4j but can see so many possibilities in graph databases, in particular IT data workflow and system impact. But unsure of the correct design for maximum efficiency.

Consider a system that takes in files, processes them, stores them in database and makes data available in various reports. However, depending on the file, the data may be in one report, but not the other.

System Architecture and Reality

An important use case is to be able to report the impact on downstream reports if upstream files are missing or components that process those files fail.

Test Cases

I have come up with 4 designs, 3 of which seem to work, but unsure which is best.

Design 1

Design 2

Design 3

Design 4

Would appreciate any help or advice on this.

Code used:

---------------------------------------------------------------------------
-- Design Experiments
---------------------------------------------------------------------------

// 1. Combination of the Workflows with shared nodes where they interact
      with same Process or DataStore
---------------------------------------------------------------------------

MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n, r

CREATE (p1:Provider {name: "Provider 1"})
CREATE (p2:Provider {name: "Provider 2"})
CREATE (f1:File {name: "File 1"})
CREATE (f2:File {name: "File 2"})
CREATE (f3:File {name: "File 3"})
CREATE (pp:PreProcess {name: "PreProcess"})
CREATE (p:Process {name: "Process"})
CREATE (d:DataStore {name: "DataStore"})
CREATE (rA:Report {name: "Report A"})
CREATE (rB:Report {name: "Report B"})
CREATE (p1)-[:PROVIDES{}]->(f1)
CREATE (p1)-[:PROVIDES{}]->(f2)
CREATE (p2)-[:PROVIDES{}]->(f3)
CREATE (f1)-[:DELIVERS_TO{}]->(pp)
CREATE (pp)-[:DELIVERS_TO{}]->(p)
CREATE (f2)-[:DELIVERS_TO{}]->(p)
CREATE (f3)-[:DELIVERS_TO{}]->(p)
CREATE (p)-[:DELIVERS_TO{}]->(d)
CREATE (d)-[:DELIVERS_TO{}]->(rA)
CREATE (d)-[:DELIVERS_TO{}]->(rB)

// Show impacted reports if Provider 1 is down
MATCH (a:Provider {name:"Provider 1"})-[r*]->(rp:Report) RETURN rp

// Show impacted reports if Provider 2 is down
MATCH (a:Provider {name:"Provider 2"})-[r*]->(rp:Report) RETURN rp


// 2. Same node relationship design as #1, but assign a workflow property
      to each node and relationship as a property array
---------------------------------------------------------------------------

MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n, r

CREATE (p1:Provider {name: "Provider 1", workflow: ["workflow1","workflow2"]})
CREATE (p2:Provider {name: "Provider 2", workflow: ["workflow3"]})
CREATE (f1:File {name: "File 1", workflow: ["workflow1"]})
CREATE (f2:File {name: "File 2", workflow: ["workflow2"]})
CREATE (f3:File {name: "File 3", workflow: ["workflow3"]})
CREATE (pp:PreProcess {name: "PreProcess", workflow: ["workflow1"]})
CREATE (p:Process {name: "Process", workflow: ["workflow1","workflow2","workflow3"]})
CREATE (d:DataStore {name: "DataStore", workflow: ["workflow1","workflow2","workflow3"]})
CREATE (rA:Report {name: "Report A", workflow: ["workflow1","workflow3"]})
CREATE (rB:Report {name: "Report B", workflow: ["workflow2"]})
CREATE (p1)-[:PROVIDES{workflow: ["workflow1"]}]->(f1)
CREATE (p1)-[:PROVIDES{workflow: ["workflow2"]}]->(f2)
CREATE (p2)-[:PROVIDES{workflow: ["workflow3"]}]->(f3)
CREATE (f1)-[:DELIVERS_TO{workflow: ["workflow1"]}]->(pp)
CREATE (pp)-[:DELIVERS_TO{workflow: ["workflow1"]}]->(p)
CREATE (f2)-[:DELIVERS_TO{workflow: ["workflow2"]}]->(p)
CREATE (f3)-[:DELIVERS_TO{workflow: ["workflow3"]}]->(p)
CREATE (p)-[:DELIVERS_TO{workflow: ["workflow1","workflow2","workflow3"]}]->(d)
CREATE (d)-[:DELIVERS_TO{workflow: ["workflow1","workflow3"]}]->(rA)
CREATE (d)-[:DELIVERS_TO{workflow: ["workflow2"]}]->(rB)

// Show individual workflows
MATCH (p) WHERE filter(x in p.workflow WHERE x = "workflow1") RETURN p
MATCH (p) WHERE filter(x in p.workflow WHERE x = "workflow2") RETURN p
MATCH (p) WHERE filter(x in p.workflow WHERE x = "workflow3") RETURN p

// Show impacted reports if Provider 1 is down
MATCH (a:Provider {name:"Provider 1"}) WITH a.workflow AS workflows 
MATCH (r:Report) WHERE filter(x in r.workflow WHERE x in workflows)
RETURN r

// Show impacted reports if Provider 2 is down
MATCH (a:Provider {name:"Provider 2"}) WITH a.workflow AS workflows 
MATCH (r:Report) WHERE filter(x in r.workflow WHERE x in workflows)
RETURN r


// 3. Same node relationship design as #1, but create a relationship
      with a workflow property for each workflow, resulting in multiple
      relatinships between nodes.
---------------------------------------------------------------------------

MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n, r

CREATE (p1:Provider {name: "Provider 1"})
CREATE (p2:Provider {name: "Provider 2"})
CREATE (f1:File {name: "File 1"})
CREATE (f2:File {name: "File 2"})
CREATE (f3:File {name: "File 3"})
CREATE (pp:PreProcess {name: "PreProcess"})
CREATE (p:Process {name: "Process"})
CREATE (d:DataStore {name: "DataStore"})
CREATE (rA:Report {name: "Report A"})
CREATE (rB:Report {name: "Report B"})
CREATE (p1)-[:PROVIDES{workflow: "workflow1"}]->(f1)
CREATE (p1)-[:PROVIDES{workflow: "workflow2"}]->(f2)
CREATE (p2)-[:PROVIDES{workflow: "workflow3"}]->(f3)
CREATE (f1)-[:DELIVERS_TO{workflow: "workflow1"}]->(pp)
CREATE (pp)-[:DELIVERS_TO{workflow: "workflow1"}]->(p)
CREATE (f2)-[:DELIVERS_TO{workflow: "workflow2"}]->(p)
CREATE (f3)-[:DELIVERS_TO{workflow: "workflow3"}]->(p)
CREATE (p)-[:DELIVERS_TO{workflow: "workflow1"}]->(d)
CREATE (p)-[:DELIVERS_TO{workflow: "workflow2"}]->(d)
CREATE (p)-[:DELIVERS_TO{workflow: "workflow3"}]->(d)
CREATE (d)-[:DELIVERS_TO{workflow: "workflow1"}]->(rA)
CREATE (d)-[:DELIVERS_TO{workflow: "workflow3"}]->(rA)
CREATE (d)-[:DELIVERS_TO{workflow: "workflow2"}]->(rB)


// Show impacted reports if Provider 1 is down
MATCH (a:Provider {name:"Provider 1"})-[j]->(n)-[r*]->(g)-[t]->(rp:Report) WHERE j.workflow=t.workflow RETURN rp

// Show impacted reports if Provider 2 is down
MATCH (a:Provider {name:"Provider 2"})-[j]->(n)-[r*]->(g)-[t]->(rp:Report) WHERE j.workflow=t.workflow RETURN rp


// 4. Distinct set of nodes and relationships for each workflow, but all
      with same node type so can still be matched
---------------------------------------------------------------------------

MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n, r

CREATE (p1:Provider {name: "Provider 1"})
CREATE (p2:Provider {name: "Provider 1"})
CREATE (p3:Provider {name: "Provider 2"})
CREATE (f1:File {name: "File 1"})
CREATE (f2:File {name: "File 2"})
CREATE (f3:File {name: "File 3"})
CREATE (pp1:PreProcess {name: "PreProcess"})
CREATE (pc1:Process {name: "Process"})
CREATE (pc2:Process {name: "Process"})
CREATE (pc3:Process {name: "Process"})
CREATE (d1:DataStore {name: "DataStore"})
CREATE (d2:DataStore {name: "DataStore"})
CREATE (d3:DataStore {name: "DataStore"})
CREATE (rA1:Report {name: "Report A"})
CREATE (rB2:Report {name: "Report B"})
CREATE (rA3:Report {name: "Report A"})
CREATE (p1)-[:PROVIDES{workflow: "workflow1"}]->(f1)
CREATE (p2)-[:PROVIDES{workflow: "workflow2"}]->(f2)
CREATE (p3)-[:PROVIDES{workflow: "workflow3"}]->(f3)
CREATE (f1)-[:DELIVERS_TO{workflow: "workflow1"}]->(pp1)
CREATE (pp1)-[:DELIVERS_TO{workflow: "workflow1"}]->(pc1)
CREATE (f2)-[:DELIVERS_TO{workflow: "workflow2"}]->(pc2)
CREATE (f3)-[:DELIVERS_TO{workflow: "workflow3"}]->(pc3)
CREATE (pc1)-[:DELIVERS_TO{workflow: "workflow1"}]->(d1)
CREATE (pc2)-[:DELIVERS_TO{workflow: "workflow2"}]->(d2)
CREATE (pc3)-[:DELIVERS_TO{workflow: "workflow3"}]->(d3)
CREATE (d1)-[:DELIVERS_TO{workflow: "workflow1"}]->(rA1)
CREATE (d2)-[:DELIVERS_TO{workflow: "workflow3"}]->(rB2)
CREATE (d3)-[:DELIVERS_TO{workflow: "workflow2"}]->(rA3)


// Show impacted reports if Provider 1 is down
MATCH (a:Provider {name:"Provider 1"})-[j*]->(rp:Report) RETURN rp

// Show impacted reports if Provider 2 is down
MATCH (a:Provider {name:"Provider 2"})-[j*]->(rp:Report) RETURN rp

Following recommendation, have expanded Design 1 to include a direct link between File and Report.

Design 1a

// 1a. Combination of the Workflows with shared nodes where they interact
   with same Process or DataStore. 
---------------------------------------------------------------------------

MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n, r

CREATE (p1:Provider {name: "Provider 1"})
CREATE (p2:Provider {name: "Provider 2"})
CREATE (f1:File {name: "File 1"})
CREATE (f2:File {name: "File 2"})
CREATE (f3:File {name: "File 3"})
CREATE (pp:PreProcess {name: "PreProcess"})
CREATE (p:Process {name: "Process"})
CREATE (d:DataStore {name: "DataStore"})
CREATE (rA:Report {name: "Report A"})
CREATE (rB:Report {name: "Report B"})
CREATE (p1)-[:PROVIDES{}]->(f1)
CREATE (p1)-[:PROVIDES{}]->(f2)
CREATE (p2)-[:PROVIDES{}]->(f3)
CREATE (f1)-[:DELIVERS_TO{}]->(pp)
CREATE (pp)-[:DELIVERS_TO{}]->(p)
CREATE (f2)-[:DELIVERS_TO{}]->(p)
CREATE (f3)-[:DELIVERS_TO{}]->(p)
CREATE (p)-[:DELIVERS_TO{}]->(d)
CREATE (d)-[:DELIVERS_TO{}]->(rA)
CREATE (d)-[:DELIVERS_TO{}]->(rB)
CREATE (f1)-[:USED_BY{}]->(rA)
CREATE (f2)-[:USED_BY{}]->(rB)
CREATE (f3)-[:USED_BY{}]->(rA)

// Show impacted reports (and path) if Provider 1 is down
MATCH path = (:Provider{name:'Provider 1'})-[:PROVIDES|USED_BY*]->(r:Report)
RETURN path, r.name AS report

// Show impacted reports (and path) if Provider 2 is down
MATCH path = (:Provider{name:'Provider 2'})-[:PROVIDES|USED_BY*]->(r:Report)
RETURN path, r.name AS report

4

1 回答 1

1

You've done some thorough exploration here, you've found designs for which your queries work. There is a cost to them, however.

Design 2 doesn't use relationships at all, so the solution doesn't seem very graphy. It also requires you to ensure the workflows lists on the relevant nodes are kept in sync and up to date. That seems to have a higher maintenance cost.

Design 3 has a similar cost, but now the properties are on the relationships, and you also have to provide redundant relationships throughout your model, so the cost is higher.

Design 4 requires redundancy of each used step in the process, where every subgraph is a single path from provider to report. While that is easy to understand and query over, redundant nodes and relationships probably aren't the way to go.

Design 1 is interesting in that it provides the correct answers but only to certain questions...questions about impacts from processors, preprocessors, and datastores in the path, what happens when these hardware and software components go down.

However it doesn't work for data lineage/dependence. Not yet. You may want to consider altering design 1 so that there are separate paths to consider for data dependence vs what you already have for the pipeline process.

Data dependence can be a different thing. If you're asking questions about this, then you're mostly concerned with the inputs and outputs, files to reports. In that case you might consider creating a :DEPENDS_ON relationship between the relevant files and reports nodes.

Consider adding this in to design 1's creation script at the end:

match (f:File), (r:Report{name:'Report A'})
where f.name in ['File 1', 'File 3']
create (r)<-[:USED_BY]-(f)

and

match (f:File), (r:Report{name:'Report B'})
where f.name in ['File 2']
create (r)<-[:USED_BY]-(f)

For questions about the data lineages, your queries can use only the relevant relationships, in this case :PROVIDES and :USED_BY.

match path = (:Provider{name:'Provider 1'})-[:PROVIDES|USED_BY*]->(r:Report)
return path, r.name as report

Or the inverse, what sources does a report draw upon?

match path = (p:Provider)-[:PROVIDES|USED_BY*]->(r:Report{name:'Report A')
return path, p.name as report

And if your model changes so that intermediary reports are modeled (the output of preprocess and process operations), then you can create :USED_BY relationships to those in a chain from the :File to the :Report (instead of directly between the :File and :Report) so you'll see the chain of dependencies during the processing.

于 2018-04-07T21:09:48.047 回答