根据 tf-idf 矩阵计算 pearson 相关系数以查看哪些项与其他项结合出现是否有意义?它在数学上正确吗?
我的输出是一个相关矩阵,每个单元格中每个术语都有相关系数。
- --------term1 term2 term3
- 学期2
- 学期2
- 学期2
根据 tf-idf 矩阵计算 pearson 相关系数以查看哪些项与其他项结合出现是否有意义?它在数学上正确吗?
我的输出是一个相关矩阵,每个单元格中每个术语都有相关系数。
It depends on your definition of 'occurs in combination with other terms'. To clarify this some more points:
idf is irrelevant when doing a Pearson mean correlation. All tf values for the same term will be multiplied by the same idf value yielding the final tf-idf. The PMC is invariant with respect to scaling of the input, so the idf is canceled out here. Hence all that matters in your proposed idea is the tf. You might save some calculations if you do not even calculate the idf, but it wont hurt much if you do.
Now about the usage of the tf. Let's make an example to figure out what you might need:
Lets say TermA
appears in Document1
very often and a little in Document2
. TermB
on the other hand appears in Document1
a little and very often in Document2
. Would you say that these two terms appear together or not? They occur in the same document, but at different frequency. If you use PMC of tf-idf then the result will be, that they do not co-occur (because of the differences in frequency).
At this point you should also note that the PMC goes from -1 to 1 in values. I.e. you could have words which co-occur (PMC=1) which are independent (PMC=0) and such words which are opposite (PMC=-1). Does this fit the domain you are modelling? If not, just add 1 to the PMC.
Another alternative would be to use cosine-similarity, which is very similar to PMC but has some different characteristics. Also in some other cases you might only be interested in actual co-occuring and do not care about the frequency.
All these methods are 'correct' so to say. The more important question is, which of these methods fits best to the problem you are modelling. In many cases this can not be determined theoretically, but only by trying out different alternatives and test which one fits best to your problem domain.
EDIT (some remarks about the comments below):
Cosine similarity does actually help, but you have to think differently in that case. You can of course produce term-frequency vectors for the terms in the document and then calculate the cosine similarity for these document term-frequency vectors. You pointed out correctly, that this would give you the similarity of posts to each other. But this is not what I meant. If you have your complete term-frequency matrix, you can also produce vectors, which describe for a single term how often this term appeared in each document. You can also calculate the cosine similarity of these vectors. This would give you the similarity of terms based on document co-occurence.
Think about it this way (but first we will need some notation):
let f_{i,j}
denote the number of times the term i
appeared in document j
(note that I am ignoring idf here, since it will just cancel out, when handling terms instead of documents). Also let F=(f_{i,j})_{i=1...N,j=1...M}
denote the whole document-term matrix (Terms go in columns and documents in rows). Then finally we will call |F|_c
the matrix F
where each colum is normalized according to the l^2
norm and |F|_r
the matrix F
where each row is normalized according to the l^2
norm. And of course as usual A^T
denotes the transpose of A
. In that case you have the normal cosine distance between all documents based on terms as
(|F|_r)*(|F|_r)^T
This would give you a MxM
matrix which describes the similarity of the documents.
If you want to calculate term similarity instead, you would simply calculate
(|F|_c)^T*(|F|_c)
which gives you an NxN
matrix describing the term similarity based on co-occurences in documents.
Note that the calculation of the PMC would basicly be the same and just differ in the type of normalisation which is applied to rows and columns in each of the matrix multiplications.
Now to your other post, you say that you would like to find out how likely it is that if termA
appears in a document, that termB
also appears in the same document. Or formaly speaking p(termB | termA)
where p(termX)
denotes the probability of termX
appearing in a document. That is a different beast altogether, but again very simple to calculate:
1. Count the number of documents in which `termA` appears (call it num_termA)
2. Count the number of documents in which both `termA` and `termB` appear (call it num_termA_termB)
then p(termB | termA)=num_termA_termB/num_termA
This is an actuall statistical measure of the likelihood of co-occurenct. Howeve be aware, most likely the relationship p(termB | termA ) == p(termA | termB)
will not hold, so this measure of co-occurence is not usable at all for clustering via MDS and this is most likely (no pun intendet).
My suggestion is to try both PMC and cosine-similarity (as you can see above they only differ in normalisation so they should be fast to implement both) and then check which one looks better after clustering.
There are some more advanced techniques for clustering topics based on a set of documents. A Principal component analysis (PCA) or Non-negative matrix factorisation of the term document matrix is also frequently used (see latent semantic analysis or LSA for more info). However this might be overkill for your use case and these techniques are much harder to do. PMC and cosine-similarity have the absolute benefit of being dead simple to implement (cosine-similarity being a bit simpler, because the normalisation is easier) and thus are hard to get wrong.