1

首先感谢您的阅读。我有一个解决我面临的问题的方法的想法,或者更好地说是一个要求。下面是一个不真实的假设场景,但在不放弃机密信息的情况下代表了问题。

想象一个服务,它有一个非常大的数据库,其中包含居住在一个国家/地区的人,比如说,1 亿条记录。

不同的公司使用该服务来查询有关这些人的数据。正常的查询是:

  • 告诉我所有 25 至 30 岁且居住在 x 市的天主教徒
  • 告诉我有两个以上孩子的人
  • 告诉我年收入在 10 万到 20 万之间的人
  • 告诉我去年度假超过 3 次的人等等。

有一个名为“People”的大表(哇..),它包含所有 100 条记录和不同字段所需的所有属性,假设它有 40 列。(int、datetime、varchar、char 等的混合)。 ) 还有一堆静态表 (25),主表引用这些表来获取描述或解析更复杂的属性。

拥有该数据库的公司提供并销售通过 Web 服务查询该数据库的能力。该 Web 服务必须接收查询、客户端 ID、参数等,并“尽可能快地”返回结果集。

到目前为止没有任何问题。

拥有这个数据库的公司有一个由商业团队管理的内部应用程序,它设置不同的规则来定义每个客户访问 x 或 y 数据的能力。他们访问的数据越多,他们为服务支付的钱就越多。

并非所有客户都可以看到所有内容,一些客户只能查询来自特定城市的人,其他客户被允许看到来自某个年龄段的人,其他人来自某种宗教“和”某些社会经济水平,等等......

拥有该数据库的公司每天都可以根据需要更改上述规则,并允许客户查看更多或更少。所有者今天只能将其更改为明天应用,从一天到另一天。不是在同一天或实时或类似的时间。

所以,我们有这个大数据库,只有一个,所有客户端,所有客户端都进行各种查询,但是系统需要在两个级别上动态过滤数据:

第一- 基于定义每个客户可以看到的业务规则

SECOND - 基于查询的参数(WHERE 子句)

我的问题是如何以有效的方式实现上述“ FIRST ”动态过滤器。

有几个选项只是举几个例子:

选项 1:为每个客户端拥有一份主表副本,并且每天截断该表并再次从主表中插入记录,检查过滤规则以仅插入客户端可以看到的那些记录。从查询性能的角度来看,这个选项很好,但从处理时间来看很糟糕,当我们有更多的客户端时它的扩展性很差——会有更多的表和更多的插入和更多的处理时间。而且它......丑陋......不喜欢那样...... :)

选项 2:将业务过滤规则动态添加到每个请求的 where 子句中。很好,因为它不需要大型批处理,但不喜欢它,因为 where 子句可能太长了,这可能会带来问题,因为所有者可以为客户端定义的过滤规则的数量没有限制,并且过滤器可能非常复杂(例如:A 公司只能访问白人,出生在 70 到 80 岁之间,金发,家庭中有 1 到 2 辆汽车,“或”,黑人等等.. 等等.. “或”混血,bla bla..“或”)所以..你明白了,不喜欢它。

选项 3:我在想的是有一个名为 Row-Client 的表或类似的表,它将具有 RowID 和可以访问的客户端。然后我们将需要一个批处理,该批处理将根据业务规则填充该表。(但不是一个大过程,因为我们只插入两个值)然后,在每个查询中,向该表添加一个连接,以获取当前正在执行请求的客户端允许的行。

它可以是这样的:

行:1 客户:1 行:1 客户:2

或者(虽然没有 ref int)

行:1 客户:1,2

甚至我们可以将“客户”列直接添加到主表中。

所以我的问题是,选项 3 是否可行(有 1050 名客户每天查询 50 次关于营业时间的问题)或知道任何其他想法或可能已建立的方法、方法或技术以有效的方式实现这一目标。

当然,我对您的问题/想法持开放态度,感谢您的帮助。

由于已经在进行分组或汇总操作,因此无法在执行查询后从结果集中删除数据

我正在寻找的是最大限度地减少批处理时间和查询响应时间。亲切的问候

4

1 回答 1

1

First of all, I have no actual experience doing this, so I'm being theoretical here. I would not, under any circumstances, repeat your people table anywhere for any reason. I would keep another client limitation rules table that would be used to filter the results of their queries.

So they can send queries for whatever they want, even data they haven't paid for, but before the results are returned, their query results pass through another process that limits their columns and/or rows by what they've paid for.

OK, given your row limitations on aggregate functions, I see you'd need a 2 (or more) step process. First, limit by row criteria, then execute their sql, then limit by columns. Limiting by rows is the tricky part, if you need them to limit by anything that might get created in the future without schema changes. The easiest thing to do is limit (ha,ha) the row criteria they can limit by, then have one table with a bunch of columns named things like 'isLimitedByAge', 'isLimitedByRace.'

Depending on your timeline, you might need to implement this in pieces, with a less sophisticated solution now, and a more dynamic one later, after you've learned more about what most clients are querying by and therefore, likely to be willing to pay for.

For a more specific example, let's say a client sends a query, 'select * from people'. the first part would be to query clientLimitRows to see what they've paid for, like people in a certain city, or people in a certain age range. That process builds the WHERE clause for the second process that actually queries the people table, performing aggregation. Then a third process checks clientLimitColumns to remove from their results any columns they haven't paid for.

Again, just my opinion, but I think you're going to have to break your client rules down. If I had to model WHERE Company A can only access White people, born between 70 and 80, with blonde hair and between 1 and 2 cars in the family, "or", black people etc.. etc.. "or" mixed race that bla bla.. "or" ) I'd have a tables with rules (one per rule), conditions (one per OR set), and clauses (one per field/operator/value tuple connected by ANDs).

So for this rule, where you're limiting by race, age, haircolor, and numCars, OR race and blah2 OR mixedRace and blah3 or blah4 you'd have 4 rows of conditions with one or more clauses.

for

rule = 1
    condition = 1
    clause1 = 'race = white'
    clause2 = 'age >= 70'
    clause3 = 'age <= 80'
    clause4 = 'haircolor = blonde'
    clause5 = 'numCars >= 1'
    clause6 = 'numCars <= 2'
    condition = 2
    clause1 = 'race = black'
    clause2 = 'field2 = blah2'
    condition = 3
    clause1 = 'race = mixedrace'
    clause2 = 'field3 = blah3'
    condition = 4
    clause1 = 'field4 = blah4'

the clause table has fields customerID, ruleID, conditionID, clauseID, field, operator, value

OK, I'm not sure I 100% understand what you're doing with option3, but it sounds like you're extending your people table with client markers, or introducing a rowID/clientID table with a 1-many relationship with people? Then you'd apply their paid for rules in an overnight process so the rows they can access are marked and can be limited with a join? I think it would work, but wouldn't they only get results from yesterday's queries today? If they pay for new data today, it won't be marked paid until tomorrow. I'm not sure you're missing anything, really, but if someone else has a better response, more power to them.

OK, I think I see where you're going. You want to create a table overnight that extends the people table with a clientID field, so each clientID has their own set of queryable rows? And you can't do that on the fly? When they send a query, first create their set of rows, then apply their query to that set?

于 2012-11-16T15:38:35.657 回答