python - cosine similarity for latex code of an equation

Question

I have extended this SO question & is comparing two latex equations. Here is two quadratic equation's example.

eqn1 = "*=\frac{-*\pm\sqrt{*^2-4ac}}{2a}"
eqn2 = "x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}"

I need to compare these as correct, because, instead of x, b, I have use * for that. All I am doing is converting equations to word list.

eqn1_word = [*,frac,*,pm,sqrt,*,2,4ac,2a]
eqn2_word = [x,frac,b,pm, sqrt, b, 2, 4ac, 2a]

so the vector is

eqn1_vec= Counter({'*': 3, 'frac': 1, 'sqrt': 1, '2': 1, '2a': 1, '4ac': 1, 'pm': 1})
eqn2_vec = Counter({'b': 2, 'frac': 1, 'sqrt': 1, '2': 1, '2a': 1, '4ac': 1, 'x': 1, 'pm': 1})

Now my extension is I am checking the percentage of * in eqn1_word, then check with normal cosine similarity as given by that answer. At last, I am adding two values, which has to nearly equal to 1.

This works fine for most of scenario(if one variable is replaced by *). Here is * value is 3 for eqn1_vec, and in eqn2_vec b = 2, x=1.

For more description & better understanding please check this. From that reference, my code is like this.

def get_cosine(self, c_eqn1_eqn, c_eqn2_eqn):
    print 'c_eqn1_eqn = ', c_eqn1_eqn
    print 'c_eqn2_eqn = ', c_eqn2_eqn
    _special_symbol = float(c_eqn1_eqn.count("*"))
    cos_result = 0
    symbol_percentage = 0
    try:
        eqn1_vector = Counter(self.get_word(c_eqn1_eqn))# get word will return word list
        eqn2_vector = Counter(self.get_word(c_eqn2_eqn))
        _words = sum([x for x in eqn1_vector.values()])
        if eqn2_vector.has_key("*"):
            _special_symbol -= eqn2_vector["*"]
        print '_special_symbol = ', _special_symbol
        print '_words @ last = ', _words
        try:
            symbol_percentage = _special_symbol / _words
        except ZeroDivisionError:
            symbol_percentage = 0.0
    except Exception as exp:
        print "Exception at converting equation to vector", exp
        traceback.print_exc()
    else:
        intersection = set(eqn1_vector.keys()) & set(eqn2_vector.keys())
        numerator = sum([eqn1_vector[x] * eqn2_vector[x] for x in intersection])
        _sum1 = sum([eqn1_vector[x]**2 for x in eqn1_vector.keys()])
        _sum2 = sum([eqn2_vector[x]**2 for x in eqn2_vector.keys()])
        denominator = math.sqrt(_sum1) * math.sqrt(_sum2)
        print 'numerator = ', numerator
        print 'denominator = ', denominator
        if not denominator:
            cos_result = 0
        else:
            cos_result = float(numerator) / denominator
        print cos_result
    final_result = float(symbol_percentage) + cos_result
    return final_result if final_result <= 1.0 else 1

The problem is numerator is getting small as intersection value is small. I have copied from my class. please ignore self.

How to solve this. Thanks in advance. If there is any mistake in question or my concept is wrong, please share with me.

score 1 · Accepted Answer

我得到了解决这个问题的方法。

由于我们可以/不应该增加分子值，我决定改为处理分母。如果 eqn2 中 * 的数量和非相交值的数量相同，我的逻辑是减小分母值。如果没有，那么就让它保持原样吧。现在我不必计算“*”的百分比，也不必将其添加到余弦结果中。

def get_cosine(c_eqn1, c_eqn2):
    _special_symbol = float(c_eqn1.count("*"))
    cos_result = 0
    try:
        eqn1_vector = Counter(get_word(c_eqn1))
        eqn2_vector = Counter(get_word(c_eqn2))
        _special_symbol = 0
        spe_list = list()
        # Storing number of * & the value contains *
        for _val in eqn1_vector.keys():
            if _val.__contains__("*"):
                _special_symbol += eqn1_vector[_val]
                spe_list.append(_val)
        if eqn2_vector.has_key("*"):
            _special_symbol -= eqn2_vector["*"]
    except Exception as exp:
        print "Exception at converting equation to vector", exp
        traceback.print_exc()
    else:
        intersection = set(eqn1_vector.keys()) & set(eqn2_vector.keys())
        numerator = sum([eqn1_vector[x] * eqn2_vector[x]
                         for x in intersection])
        non_intersection_sum = 0
        non_intersection_value = list()
        # storing no of non_matched value
        for _val in eqn2_vector.keys():
            if _val not in intersection:
                non_intersection_sum += eqn2_vector[_val]
                non_intersection_value.append(_val)
        # Join both non intercet lists
        if non_intersection_value:
            non_intersection_value.extend(spe_list)
        # If both non intersect value are not same
        # Empty the list
        if _special_symbol != non_intersection_sum:
            non_intersection_value = list()
        # Cosine similarity formula
        _sum1 = sum([eqn1_vector[x]**2 for x in eqn1_vector.keys() if x not in non_intersection_value])
        _sum2 = sum([eqn2_vector[x]**2 for x in eqn2_vector.keys() if x not in non_intersection_value])
        denominator = math.sqrt(_sum1) * math.sqrt(_sum2)
        if not denominator:
            cos_result = 0
        else:
            cos_result = float(numerator) / denominator
    return cos_result if cos_result <= 1.0 else 1

python - cosine similarity for latex code of an equation

1 回答 1

Related

Reference