7

I have 350 document scores that, when I plot them, have this shape:

docScores = [(0, 68.62998962), (1, 60.21374512), (2, 54.72480392), 
             (3, 50.71389389), (4, 49.39723969), ...,  
             (345, 28.3756237), (346, 28.37126923), 
             (347, 28.36397934), (348, 28.35762787), (349, 28.34219933)]

I posted the complete array here on pastebin (it corresponds to the dataPoints list on the code below).

Score distribution

Now, I originally needed to find the elbow point of this L-shape curve, which I found thanks to this post.

Now, on the following plot, the red vector p represents the elbow point. I would like to find the point x=(?,?) (the yellow star) on the vector b which corresponds to the orthogonal projection of p onto b.

enter image description here

The red point on the plot is the one I obtain (which is obviously wrong). I obtain it doing the following:

b_hat = b / np.linalg.norm(b)    #unit vector of b
proj_p_onto_b = p.dot(b_hat)*b_hat
red_point = proj_p_onto_b + s

Now, if the projection of p onto b is defined by the its starting and ending point, namely s and x (the yellow star), it follows that proj_p_onto_b = x - s, therefore x = proj_p_onto_b + s ?

Did I make a mistake here ?

EDIT : In answer to @cxw, here is the code for computing the elbow point :

def findElbowPoint(self, rawDocScores):
    dataPoints = zip(range(0, len(rawDocScores)), rawDocScores)
    s = np.array(dataPoints[0])
    l = np.array(dataPoints[len(dataPoints)-1])
    b_vect = l-s
    b_hat = b_vect/np.linalg.norm(b_vect)
    distances = []
    for scoreVec in dataPoints[1:]:
        p = np.array(scoreVec) - s
        proj = p.dot(b_hat)*b_hat
        d = abs(np.linalg.norm(p - proj)) # orthgonal distance between b and the L-curve
        distances.append((scoreVec[0], scoreVec[1], proj, d))

    elbow_x = max(distances, key=itemgetter(3))[0]
    elbow_y = max(distances, key=itemgetter(3))[1]
    proj = max(distances, key=itemgetter(3))[2]
    max_distance = max(distances, key=itemgetter(3))[3]

    red_point = proj + s

EDIT : Here is the code for the plot :

>>> l_curve_x_values = [x[0] for x in docScores]
>>> l_curve_y_values = [x[1] for x in docScores]
>>> b_line_x_values = [x[0] for x in docScores]
>>> b_line_y_values = np.linspace(s[1], l[1], len(docScores))
>>> p_line_x_values = l_curve_x_values[:elbow_x]
>>> p_line_y_values = np.linspace(s[1], elbow_y, elbow_x)
>>> plt.plot(l_curve_x_values, l_curve_y_values, b_line_x_values, b_line_y_values, p_line_x_values, p_line_y_values)
>>> red_point = proj + s
>>> plt.plot(red_point[0], red_point[1], 'ro')
>>> plt.show()
4

2 回答 2

4

If you are using the plot to visually determine if the solution looks correct, you must plot the data using the same scale on each axis, i.e. use plt.axis('equal'). If the axes do not have equal scales, the angles between lines are distorted in the plot.

于 2016-10-06T14:02:54.353 回答
1

First of all, is the point at ~(50, 37) p or s+p? If p, that might be your problem right there! If the Y component of your p variable is positive, you won't get the results you expect when you do the dot product.

Assuming that point is s+p, if a bit of Post-It scribbling is correct,

p_len = np.linalg.norm(p)
p_hat = p / p_len
red_len = p_hat.dot(b_hat) * p_len   # red_len = |x-s|
    # because p_hat . b_hat = 1 * 1 * cos(angle) = |x-s| / |p|
red_point = s + red_len * b_hat

Not tested! YMMV. Hope this helps.

于 2016-10-06T12:12:43.370 回答