2

我有一个名为 conversations_grouped 的关系,由不同大小的元组包组成,如下所示:

DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})

每个 L[0-9]+ 是一个标签,对应一个字符串。例如,L194 可能是“你好,你好吗?” L195 可能是“很好,你好吗?”。这种对应关系由名为 line_map 的地图维护。这是一个示例:

DUMP line_map;
...
([L666324#Do you think she might be interested in  someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])

我现在要做的是解析每一行并将 L[0-9]+ 标记替换为来自 line_map 的相应文本。是否可以从 Pig FOREACH 语句中引用 line_map,或者我还需要做什么?

4

1 回答 1

1

第一个问题是,在地图中,键必须是带引号的字符串。因此,您不能使用模式值来访问地图。EG 这行不通。

C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;

想到的解决方案是 FLATTEN conversations_grouped。然后在 L[0-9]+ 标签上的 conversations_grouped 和 line_map 之间进行连接。您可能希望投影出一些额外的字段(例如连接后的 L[0-9]+ 标记)以加快下一步的速度。之后,您必须重新组合数据,并将其调整为正确的格式。

除非每个包都有自己的用于重新分组的唯一 ID,否则这将不起作用,但如果每个 L[0-9]+ 标签只出现在一个包(对话)中,您可以使用它来创建唯一 ID。

-- A is dumped conversations_grouped

B = FOREACH A {
    -- Pulls out an element from the bag to use as the id
    id = LIMIT tags 1 ;
    -- Flattens B into id, tag form.  Each group of tags will have the same id.
    GENERATE FLATTEN(id), FLATTEN(tags) ; 
    } 

B 的模式和输出是:

B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)

假设标签是唯一的,其余的完成如下:

-- A2 is line_map, loaded in tag/message pairs instead of a map

-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
    -- This generate removes the tag
    GENERATE id, message ;

-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id) 
    -- This step limits the output to just messages
    GENERATE C.(message) AS messages ;

D 的模式和输出:

D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})

注意:如果在最坏的情况下(L[0-9]+ 标签不是唯一的),您可以在将输入文件的每一行加载到 pig 之前为输入文件的每一行指定一个连续的整数 id。

更新:如果您使用的是 pig 0.11,那么您也可以使用RANK运算符。

于 2013-07-17T03:54:19.453 回答