sql - Word count for all the words appearing in a column in SQL Server 2008

Question

I have a table called 'ticket_diary_comment' with a column called 'comment_text'. This column is populated with text data. I would like to get the frequency of all the words occurring in this entire column. Ex:

Comment_Text
I am a good guy
I am a bad guy
I am not a guy

What I want:

Word    Frequency
I       3
good    1
bad     1
not     1
guy     3

Notice that I have also removed the stop words in the output. I know calculating the frequency of a particular word is not difficult but I am looking for something that counts all the words appearing in a column removing the stop words.

I would appreciate any kind of help on this issue. I would also like to mention that I have to apply this query on a big-ish dataset (about 1 TB), so performance is a concern.

score 4 · Accepted Answer

我会使用表值函数来拆分字符串，然后将它们分组到查询中。像这样的东西：

SELECT item, count(1)
FROM ticket_diary_comment 
    CROSS APPLY dbo.fn_SplitString(comment_text, ' ')
GROUP BY item

以及的定义fn_SplitString：

CREATE FUNCTION [dbo].[fn_SplitString]   
(   
    @String VARCHAR(8000),   
    @Delimiter VARCHAR(255)   
)   
RETURNS   
@Results TABLE   
(   
    ID INT IDENTITY(1, 1),   
    Item VARCHAR(8000)   
)   
AS   
BEGIN   
INSERT INTO @Results (Item)   
SELECT SUBSTRING(@String+@Delimiter, num,   
    CHARINDEX(@Delimiter, @String+@Delimiter, num) - num)   
FROM Numbers   
WHERE num <= LEN(REPLACE(@String,' ','|'))   
AND SUBSTRING(@Delimiter + @String,   
            num,   
            LEN(REPLACE(@delimiter,' ','|'))) = @Delimiter   
ORDER BY num RETURN   
END

此功能需要一个数字表，它基本上只是CREATE TABLE Numbers(Num int)包含从 1 到 10,000 的所有数字（或更多/更少，取决于需要）。如果您的数据库中已经有一个数字表，您可以用该表/列替换您已有的。

sql - Word count for all the words appearing in a column in SQL Server 2008

1 回答 1

Related

Reference