如何计算多标签注释的 Krippendorff Alpha?在多类注释的情况下(假设 3 个编码人员用 3 个标签注释了 4 个文本:a、b、c),我首先构造可靠性数据矩阵,然后构造巧合,根据巧合我可以计算 Alpha:
问题是在多标签分类问题(如下例)的情况下,我如何准备巧合并计算 alpha?
Python 实现甚至 excel 将不胜感激。
如何计算多标签注释的 Krippendorff Alpha?在多类注释的情况下(假设 3 个编码人员用 3 个标签注释了 4 个文本:a、b、c),我首先构造可靠性数据矩阵,然后构造巧合,根据巧合我可以计算 Alpha:
问题是在多标签分类问题(如下例)的情况下,我如何准备巧合并计算 alpha?
Python 实现甚至 excel 将不胜感激。
在寻找类似信息时遇到您的问题。我们使用下面的代码,nltk.agreement
用于度量并pandas_ods_reader
从 LibreOffice 电子表格中读取数据。我们的数据有两个注释器,对于某些项目可以有两个标签(例如,一个编码器只注释一个标签,另一个编码器注释两个标签)。
下面的电子表格屏幕截图显示了输入数据的结构。注释项的列称为annotItems
,注释列称为coder1
和coder2
。当有多个标签时,分隔符是一个管道,与示例中的逗号不同。
该代码受此 SO 帖子的启发:Low alpha for NLTK agreement using MASI distance
from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
import pandas_ods_reader as pdreader
annotfile = "test-iaa-so.ods"
df = pdreader.read_ods(annotfile, "Sheet1")
annots = []
def create_annot(an):
"""
Create frozensets with the unique label
or with both labels splitting on pipe.
Unique label has to go in a list so that
frozenset does not split it into characters.
"""
if "|" in str(an):
an = frozenset(an.split("|"))
else:
# single label has to go in a list
# need to cast or not depends on your data
an = frozenset([str(int(an))])
return an
for idx, row in df.iterrows():
annot_id = row.annotItem + str.zfill(str(idx), 3)
annot_coder1 = ['coder1', annot_id, create_annot(row.coder1)]
annot_coder2 = ['coder2', annot_id, create_annot(row.coder2)]
annots.append(annot_coder1)
annots.append(annot_coder2)
# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(annots)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
对于从这个答案链接的屏幕截图中的数据,这将打印:
Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236
Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425