3

这是我第一次使用 SQL。我在 Windows 7 64 位上使用 PostgreSQL。

我有以下(大).txt 推文文件,如下所示:

T   2009-06-07 02:07:41
U   http://twitter.com/cyberplumber
W   SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw

如您所见,所有三个“列”都以下列方式分隔: T \t(U 和 W 也是如此)而不是传统的逗号 ( ,)。我想将整个文件导入一个 SQL 表,其中的列名为date,usertext_msg.

我猜我可能不得不以某种方式解析它。任何想法如何以最简单和最有效的方式将数据放入表中?还请考虑到有问题的 .txt 文件相当大(> 4GB),因此我没有简单的方法来手动编辑它们。

4

2 回答 2

2

尝试执行以下操作:

首先,在 SQL 中创建一个适当的表,如下所示:

CREATE TABLE tweet(
 ts timestamp, -- if inserting the values as timestamps gives errors, change to 'TEXT'
 url TEXT,  --  There smarted UDTs for URL available too
 message TEXT
);

然后继续尝试运行标准的 COPY 语句,如下所示:

COPY tweet 
FROM E'c:\\\\my dir\\\filename'  -- path of the file using the magic E for escaped text with double backslash between directory names fro Windows
FORMAT text;   -- The default delimiter  for format text is a tab

最后,祈祷你有足够的内存和日志空间来存储 > 4GB 的文件。有关 COPY 命令的更多信息,请参阅http://www.postgresql.org/docs/9.2/static/sql-copy.html

于 2013-04-28T18:55:14.390 回答
2

快速和肮脏的黑客:

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE  lutser
        ( id SERIAL NOT NULL PRIMARY KEY
        , ztxt text
        );

CREATE TABLE  tweetdeck
        ( id SERIAL NOT NULL PRIMARY KEY
        , stamp timestamp NOT NULL
        , zurl text
        , ztxt text
        );

COPY lutser(ztxt)
FROM '/tmp/tweet.dat'
        ;

INSERT INTO tweetdeck (stamp, zurl, ztxt)
SELECT regexp_replace( t.ztxt, E'^[A-Z][ \t]*', '')::timestamp
        , regexp_replace( u.ztxt, E'^[A-Z][ \t]*', '')
        , regexp_replace( w.ztxt, E'^[A-Z][ \t]*', '')
FROM lutser t
JOIN lutser u ON u.id = t.id+1
JOIN lutser w ON w.id = t.id+2
WHERE t.id %3 = 1
AND LEFT(t.ztxt,1) = 'T' -- Should be redundant, Won't harm
AND LEFT(u.ztxt,1) = 'U'
AND LEFT(w.ztxt,1) = 'W'
        ;


SELECT * FROM lutser;
SELECT * FROM tweetdeck;

结果:

COPY 9
INSERT 0 3
 id |                                                                       ztxt                                                                       
----+--------------------------------------------------------------------------------------------------------------------------------------------------
  1 | T   2009-06-07 02:07:31
  2 | U   http://twitter.com/cyberplumber
  3 | W   SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
  4 | T   2009-06-07 02:07:41
  5 | U   http://twitter.com/cyberplumber
  6 | W   SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
  7 | T   2009-06-07 02:07:51
  8 | U   http://twitter.com/cyberplumber
  9 | W   SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
(9 rows)


 id |        stamp        |              zurl               |                                                                     ztxt                                                                     
----+---------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------
  1 | 2009-06-07 02:07:31 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
  2 | 2009-06-07 02:07:41 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
  3 | 2009-06-07 02:07:51 | http://twitter.com/cyberplumber | SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw
(3 rows)
于 2013-04-29T14:22:19.713 回答