6

编辑:这是一组更完整的代码,它准确地显示了下面的答案所发生的事情。

libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
  SELECT DISTINCT id
  FROM mydb.sale_volume AS sv
  WHERE sv.category IN ('a', 'b', 'c') AND
    sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
  SELECT id, SUM(sales)
  FROM mydb.sale_volue AS sv
  INNER JOIN output.id AS ids
    ON ids.id = sv.id
  WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
  GROUP BY id
)
run;

目标是简单地根据类别成员查询表中的某些 id。然后我总结了这些成员在所有类别中的活动。

上述方法远慢于:

  1. 运行第一个查询以获取子集
  2. 运行第二个查询每个 ID 的总和
  3. 运行内部连接两个结果集的第三个查询。

如果我理解正确,确保我的所有代码完全通过而不是交叉加载可能会更有效。


在昨天发布了一个问题后,一位成员建议我可能会从针对我的情况提出一个更具体的关于性能的单独问题中受益。

我正在使用 SAS Enterprise Guide 编写一些程序/数据查询。我无权修改存储在“Teradata”中的基础数据。

我的基本问题是在这种环境中编写高效的 SQL 查询。例如,我在一个大表(包含数千万条记录)中查询一小部分 ID。然后,我使用这个子集再次查询更大的表:

proc sql;
CREATE TABLE subset AS (
  SELECT
    id
  FROM
    bigTable
  WHERE
    someValue = x AND
    date BETWEEN a AND b

)

这可以在几秒钟内完成并返回 90k ID。接下来,我想针对大表查询这组ID,问题接踵而至。我想随着时间的推移对 ID 的值求和:

proc sql;
CREATE TABLE subset_data AS (
  SELECT
    bigTable.id,
    SUM(bigTable.value) AS total
  FROM
    bigTable
  INNER JOIN subset
    ON subset.id = bigTable.id
  WHERE
    bigTable.date BETWEEN a AND b
  GROUP BY
    bigTable.id
)

无论出于何种原因,这需要很长时间。不同之处在于第一个查询标记了“someValue”。第二个查看所有活动,无论“someValue”中有什么。例如,我可以标记每个订购披萨的顾客。然后我会查看所有订购披萨的顾客的每一次购买。

我对 SAS 并不太熟悉,所以我正在寻找有关如何更有效地执行此操作或加快速度的任何建议。我愿意接受任何想法或建议,如果我能提供更多细节,请告诉我。我想我只是对第二个查询需要这么长时间来处理感到惊讶。

4

5 回答 5

8

The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.

SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.

The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:

proc sql;
   connect to teradata (user=userid password=password mode=teradata);
   create table mydata as
   select * from connection to teradata (
      select a.customer_id
           , a.customer_name
           , b.last_payment_date
           , b.last_payment_amt
      from base.customers a
      join base.invoices b
      on a.customer_id=b.customer_id
      where b.bill_month = date '2013-07-01'
        and b.paid_flag = 'N'
      );
quit;

Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.

The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:

proc sql;
   CREATE TABLE subset_data AS
   SELECT bigTable.id,
          SUM(bigTable.value) AS total
   FROM   TDATA.bigTable bigTable
   JOIN   TDATA.subset subset
   ON     subset.id = bigTable.id
   WHERE  bigTable.date BETWEEN a AND b
   GROUP BY bigTable.id
   ;

That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.

One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.

I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.

于 2013-07-10T23:14:44.640 回答
1

如果 id 是唯一的,您可以将 UNIQUE PRIMARY INDEX(id) 添加到该表,否则默认为非唯一 PI。了解唯一性有助于优化器制定更好的计划。

如果没有更多信息,例如解释(只需将 EXPLAIN 放在 SELECT 前面),就很难说出如何改进它。

于 2013-07-10T18:33:25.537 回答
1

如果 ID 是唯一的并且是单个值,那么您可以尝试构造格式。

创建一个如下所示的数据集:

fmtname, start, label

其中 fmtname 对于所有记录都是相同的,一个合法的格式名称(以字母开头和结尾,包含字母数字或 _);start 是 ID 值;并且标签为 1。然后为 fmtname 添加具有相同值的一行、空白开始、标签 0 和另一个变量hlo='o'(对于“其他”)。然后使用该选项导入 proc 格式CNTLIN,您现在有一个 1/0 值转换。

这是一个使用 SASHELP.CLASS 的简短示例。这里的 ID 是名称,但它可以是数字或字符 - 以适合您的方式使用。

data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you.  Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
  hlo='o';
  call missing(start);
  label='0';
  output;
end;
run;
proc format cntlin=for_fmt;
quit;

现在,您可以“正常”地进行查询,而不是进行连接,但需要附加 where 子句and put(id,$IDF.)='1'。这不会使用索引或任何东西进行优化,但它可能比连接更快。(它也可能不会更快 - 取决于 SQL 优化器的工作方式。)

于 2013-07-10T16:46:47.557 回答
1

您暗示第一个查询中的 90k 记录都是唯一id的。这是确定的吗?

我问是因为您的第二个查询的含义是它们不是唯一的。
-id随着时间的推移,一个人可以有多个值,并且有不同somevalue的 s

如果ids 在第一个数据集中不是唯一的,则需要GROUP BY id或使用DISTINCT, 在第一个查询中。

想象一下,这 90k 行由 30k 个唯一id的 s 组成,因此每个 s 平均有 3 行id

然后想象那些 30k 唯一id的 s 实际上在你的时间窗口中有 9 条记录,包括行 where somevalue <> x

然后,您将获得 3x9 记录id

随着这两个数字的增长,第二个查询中的记录数呈几何级数增长。


替代查询

如果这不是问题,那么替代查询(不理想,但可能)将是......

SELECT
  bigTable.id,
  SUM(bigTable.value) AS total
FROM
  bigTable
WHERE
  bigTable.date BETWEEN a AND b
GROUP BY
  bigTable.id
HAVING
  MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
于 2013-07-10T16:03:13.640 回答
0

一种替代解决方案是使用 SAS 过程。我不知道你的实际 SQL 在做什么,但如果你只是在做频率(或其他可以在 PROC 中完成的事情),你可以这样做:

proc sql;
create view blah as select ... (your join);
quit;

proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;

或任何数量的其他选项(PROC MEANS、PROC TABULATE 等)。这可能比在 SQL 中求和更快(取决于一些细节,例如数据的组织方式、实际执行的操作以及可用的内存量)。如果您在数据库中创建视图,SAS 可能会选择在数据库中执行此操作,这可能会更快。(实际上,如果您只是从基表运行频率,可能会更快,然后将结果连接到较小的表)。

于 2013-07-10T19:59:25.827 回答