我正在使用自然语言框架在信用卡上查找个人姓名。首先,我使用 Vision 框架阅读信用卡文本。然后我连接它。
所以我有包含与此类似的格式的文本:
"Citi 6011 1111 1111 1117 07/25 ELON MUSK Discover Debit"
我已经厌倦了在这样的字符串中简单地找到 .personalName NLTags 但它并不完美。
let tagger = NLTagger(tagSchemes: [.nameType])
tagger.string = text
tagger.setLanguage(.english, range: range)
var fullNames = [String]()
// 1) personalName
let options : NLTagger.Options = [.omitPunctuation, .omitWhitespace, .omitOther, .joinNames]
let foundTags = tagger.tags(in: range, unit: .word, scheme: .nameType, options: options)
foundTags.forEach { (tag, tokenRange) in
if tag == .personalName {
let name = text[tokenRange]
fullNames.append(String(name))
}
}
所以我试图在 macOS playeground 中使用 CreateML 来创建 CoreML 模型,该模型将用于在此类字符串中标记自定义 CARDHOLDER 标签。
需要多少示例标记数据来训练这样的模型才能使用?
现在我有这样一个比较简单的案例
[
{
"tokens": ["Inteligo", "4242 4242 4242 4242", "09/21", "JOHN SMITH", "VISA", "Debit"],
"labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
},
{
"tokens": ["mBank", "4000 0566 5566 5556", "10/23", "STEVE JOBS", "VISA", "Debit"],
"labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
},
{
"tokens": ["ING", "5555 5555 5555 4444", "03/22", "EMMA WATSON", "mastercard", "Debit"],
"labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
},
{
"tokens": ["Bank of America", "3782 822463 10005", "05/24", "JULIA ROBERTS", "American Express", "Debit"],
"labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
},
{
"tokens": ["Citi", "6011 1111 1111 1117", "07/25", "ELON MUSK", "Discover", "Debit"],
"labels": ["ORG", "NONE", "NONE", "CARDHOLDER", "ORG", "NONE"]
}
]
然后我从中创建 CoreML 模型
import Foundation
import CreateML
let trainingData = try MLDataTable(contentsOf:
Bundle.main.url(forResource: "data", withExtension: "json")!)
let model = try MLWordTagger(trainingData: trainingData, tokenColumn: "tokens", labelColumn: "labels")
try model.write(to: URL(fileURLWithPath: "#path/to/Desktop/creditcardtagger.mlmodel"))
然后我使用它:
let tagger = CreditCardTagger.shared
tagger.string = text
tagger.setLanguage(.english, range: range)
tagger.enumerateTags(in: range, unit: .word, scheme: CreditCardTagger.Scheme, options: options) { (tag, tokenRange) -> Bool in
if tag == CreditCardTagger.cardHolderTag {
let cardHolder = text[tokenRange]
print("CARD HOLDER: \(cardHolder)")
}
return true
}
但我认为我用于培训的数据不足。您知道要涵盖大多数信用卡案件需要多少此类数据记录吗?