tensorflow - 了解 TFDV 中使用的 L-infinity 范数

Question

我试图实现 TensorFlow 数据验证来检查数据集中的漂移/偏斜。他们使用 L-infinity 范数作为衡量标准。我不明白这个概念。谁能解释它是如何计算的以及为什么他们在这里使用阈值作为 0.01？

 train_day1_stats = tfdv.generate_statistics_from_tfrecord(data_location=train_day1_data_path)
# Add a drift comparator to schema for 'payment_type' and set the threshold of L-infinity norm for triggering drift anomaly to be 0.01.
**tfdv.get_feature(schema, 'payment_type').drift_comparator.infinity_norm.threshold = 0.01**
drift_anomalies = tfdv.validate_statistics(
    statistics=train_day2_stats, schema=schema, previous_statistics=train_day1_stats)

TensorFlow 网站图片

score 0 · Accepted Answer

COMPARATOR_L_INFTY_HIGH 触发如下：

使用的模式字段：* feature.skew_comparator.infinity_norm.threshold。
* feature.drift_comparator.infinity_norm.threshold
统计字段：* feature.string_stats.rank_histogram
检测条件：向量的 L-infinity 范数，表示控制统计中的 feature.string_stats.rank_histogram 的归一化计数之间的差异（即，服务统计数据偏斜或先前的漂移统计数据）和处理统计数据（即训练偏斜统计或漂移当前统计）> feature.skew_comparator.infinity_norm.threshold 或 feature.drift_comparator.infinity_norm.threshold

L-infinity 形式基本上是 abs(max([x1,....,xn]) 在这种情况下 x1 = count(values bucket1)/控制集中的总值 - count(values bucket1)/治疗集中的总值. 一旦我们有 L-inf 我们检查 > (feature.skew_comparator.infinity_norm.threshold 或 feature.drift_comparator.infinity_norm.threshold) 如果是这样，COMPARATOR_L_INFTY_HIGH 被触发。实际值（0.01）需要根据您的特定案例和数据统计。

score 0 · Accepted Answer

张量流文档中解释了详细的检测条件（下面的链接），

https://www.tensorflow.org/tfx/data_validation/anomalies

对于您提到的情况，

COMPARATOR_L_INFTY_HIGH

架构字段：

feature.skew_comparator.infinity_norm.threshold feature.drift_comparator.infinity_norm.threshold

统计字段：

feature.string_stats.rank_histogram*

检测条件： 向量的 L-无穷大范数，表示feature.string_stats.rank_histogram控制统计中的归一化计数（即，偏斜的服务统计或漂移的先前统计）与处理统计（即偏斜或当前的训练统计）之间的差异漂移统计）> feature.skew_comparator.infinity_norm.threshold或feature.drift_comparator.infinity_norm.threshold

tensorflow - 了解 TFDV 中使用的 L-infinity 范数

2 回答 2

Related

Reference