2

Here's the scenario:

Say you have a Hive Table that stores twitter data.

Say it has 5 columns. One column being the Text Data.

Now How do you add a 6th column that stores the sentiment value from the Sentiment Analysis of the twitter Text data. I plan to use the Sentiment Analysis API like Sentiment140 or viralheat.

I would appreciate any tips on how to implement the "derived" column in Hive.

Thanks.

4

2 回答 2

1

Unfortunately, while the Hive API lets you add a new column to your table (using ALTER TABLE foo ADD COLUMNS (bar binary)), those new columns will be NULL and cannot be populated. The only way to add data to these columns is to clear the table's rows and load data from a new file, this new file having that new column's data.

To answer your question: You can't, in Hive. To do what you propose, you would have to have a file with 6 columns, the 6th already containing the sentiment analysis data. This could then be loaded into your HDFS, and queried using Hive.

EDIT: Just tried an example where I exported the table as a .csv after adding the new column (see above), and popped that into M$ Excel where I was able to perform functions on the table values. After adding functions, I just saved and uploaded the .csv, and rebuilt the table from it. Not sure if this is helpful to you specifically (since it's not likely that sentiment analysis can be done in Excel), but may be of use to anyone else just wanting to have computed columns in Hive.

References:

https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-DDLOperations

http://comments.gmane.org/gmane.comp.java.hadoop.hive.user/6665

于 2013-02-27T03:23:40.287 回答
1

You can do this in two steps without a separate table. Steps:

  1. Alter the original table to add the required column
  2. Do an "overwrite table select" of all columns + your computed column from the original table into the original table.

Caveat: This has not been tested on a clustered installation.

于 2013-03-23T08:15:31.343 回答