3

I am looking to track topic popularity on a very large number of documents. Furthermore, I would like to give recommendations to users based on topics, rather than the usual bag of words model. To extract the topics I use natural language processing techniques that are beyond the point of this post.

My question is how should I persist this data so that: I) I can quickly fetch trending data for each topic (in principle, every time a user opens a document, the topics in that document should go up in popularity) II) I can quickly compare documents to provide recommendations (here I am thinking of using clustering techniques)

More specifically, my questions are: 1) Should I go with the usual way of storing text mining data? meaning storing a topic occurrence vector for each document, so that I can later measure the euclidean distance between different documents. 2) Some other way?

I am looking for specific python ways to do this. I have looked into SQL and NoSQL databases, and also into pytables and h5py, but I am not sure how I would go about implementing a system like this. One of my concerns is how can I deal with an ever growing vocabulary of topics?

Thank you very much

4

2 回答 2

1

Why not have simple SQL tables

Tables:

  • documents with a primary key of id or file name or something
  • observations with foreign key into documents and the term (indexed on both fields probably unique)

The array approach you mentioned seems like a slow way to get at terms. With sql you can easily allow new terms be added to the observations table.

Easy to aggregate and even do trending stuff by aggregating by date if the documents table includes a timestamp.

于 2012-06-29T18:48:07.317 回答
1

I would suggest that you do this work in a SQL database. You may not want to store the documents there, but the topics are appropriate.

You want one table just for the topics:

create table Topics (
    TopicId int identity(1,1), -- SQL Server for auto increment column
    TopicName varchar(255),
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

You want another table for the topics assigned to documents, assuming that you have some sort of document id to identify documents:

create table DocumentTopics (
    DocumentTopicId int identity(1,1), -- SQL Server for auto increment column
    TopicId int,
    DocumentID int,
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

And another table for document views:

create table DocumentView (
    DocumentViewId int identity(1,1), -- SQL Server for auto increment column
    DocumentId int,
    ViewedAt datetime,
    viewedBy int, -- some sort of user id
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

Now you can get the topics by popularity for a given date range using a query such as:

select t.TopicId, t.TopicName, count(*) as cnt
from DocumentUsage du join
     DocumentTopics dt
     on du.DocumentId = dt.DocumentId join
     Topics t
     on dt.TopicsId = t.TopicsId
where du.ViewedAt between <date1> and <date2>
group by t.TopicId, t.TopicName
order by 3 desc

You can also get information about users, changes over time, and other information. You could have a users table, which could provide weights for the topics (more reliable users, less reliable users). This aspect of the system should be done in SQL.

于 2012-06-29T22:47:30.850 回答