2

我正在使用自然语言框架在信用卡上查找个人姓名。首先,我使用 Vision 框架阅读信用卡文本。然后我连接它。

所以我有包含与此类似的格式的文本:

"Citi 6011 1111 1111 1117 07/25 ELON MUSK Discover Debit"

我已经厌倦了在这样的字符串中简单地找到 .personalName NLTags 但它并不完美。

let tagger = NLTagger(tagSchemes: [.nameType])
    tagger.string = text
    tagger.setLanguage(.english, range: range)

    var fullNames = [String]()

    // 1) personalName
    let options : NLTagger.Options = [.omitPunctuation, .omitWhitespace, .omitOther, .joinNames]
    let foundTags = tagger.tags(in: range, unit: .word, scheme: .nameType, options: options)

    foundTags.forEach { (tag, tokenRange) in
        if tag == .personalName {
            let name = text[tokenRange]
            fullNames.append(String(name))
        }
    }

所以我试图在 macOS playeground 中使用 CreateML 来创建 CoreML 模型,该模型将用于在此类字符串中标记自定义 CARDHOLDER 标签。

需要多少示例标记数据来训练这样的模型才能使用?

现在我有这样一个比较简单的案例

[
    {
        "tokens": ["Inteligo", "4242 4242 4242 4242", "09/21", "JOHN SMITH", "VISA", "Debit"],
        "labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
    },
    {
        "tokens": ["mBank", "4000 0566 5566 5556", "10/23", "STEVE JOBS", "VISA", "Debit"],
        "labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
    },
    {
        "tokens": ["ING", "5555 5555 5555 4444", "03/22", "EMMA WATSON", "mastercard", "Debit"],
        "labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
    },
    {
        "tokens": ["Bank of America", "3782 822463 10005", "05/24", "JULIA ROBERTS", "American Express", "Debit"],
        "labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
    },
    {
        "tokens": ["Citi", "6011 1111 1111 1117", "07/25", "ELON MUSK", "Discover", "Debit"],
        "labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
    }
]

然后我从中创建 CoreML 模型

import Foundation
import CreateML

let trainingData = try MLDataTable(contentsOf:
    Bundle.main.url(forResource: "data", withExtension: "json")!)

let model = try MLWordTagger(trainingData: trainingData, tokenColumn: "tokens", labelColumn: "labels")

try model.write(to: URL(fileURLWithPath: "#path/to/Desktop/creditcardtagger.mlmodel"))

然后我使用它:

let tagger = CreditCardTagger.shared
        tagger.string = text
        tagger.setLanguage(.english, range: range)
tagger.enumerateTags(in: range, unit: .word, scheme: CreditCardTagger.Scheme, options: options) { (tag, tokenRange) -> Bool in

            if tag == CreditCardTagger.cardHolderTag {
                let cardHolder = text[tokenRange]
                print("CARD HOLDER: \(cardHolder)")
            }
            return true
        }

但我认为我用于培训的数据不足。您知道要涵盖大多数信用卡案件需要多少此类数据记录吗?

4

0 回答 0