我正在尝试使用我在 Google ML 网站某处找到的 JSONL 规范生成我的训练样本。但是在导入我的数据时,我得到:

错误:gs://tx_harris_rel_0/tx_harris_rel_0.jsonl 第 1 行的注释 1 由 gs://tx_harris_rel_0/tx_harris_rel_0.csv 引用:文本内容不能互换。

(对我所有的注释重复,对于每个 jsonl 行/文档)。


{ "annotations": [{ "text_extraction": { "text_segment": { "end_offset": 96, "start_offset": 84} }, "display_name": "_NAME_TO_" }, { "text_extraction": { "text_segment": { "end_offset": 184, "start_offset": 167} }, "display_name": "_NAME_FROM_" }], "text_snippet": {"content": "RELEASE OF CHILD SUPPORT LIEN\n\nTo the County Clerk: Harris County, Texas\nObligor:\tPETE VASQUEZ\n\nDate of Birth: 11/21/1971\n\nDL#:\txxxxx413\nSSN:\txxx-xx-x629\nObligee:\tCHRISTINA L PYRON\n\nCourt:\t311 TH JUDICIAL DISTRICT, HARRIS COUNTY, TEXAS\nCause #:\t9443004\n\nAG#:\t0213575481\tUNIT:0615E\n\nChild support lien being released: U396484 filed on May 18,2000.\n\nIn accordance with Texas Family Code § 157.321, the Office of the Attorney General of the State of Texas releases the child support lien described above.\n\nUnder penalty of perjury, I affirm and declare the foregoing to be a true statement.\n\nMaribdl Davila\nOffice of the Attorney General\nChild Support Division\n\nState of Texas\n\nCounty of Travis\n\nBefore me, the undersigned notary public, on this day personally appeared Maribel Davila known to me to be the person whose name is subscribed to the foregoing instrument and acknowledged to me that he/she executed the same for the purposes and consideration therein expressed.\n\nGiven under my hand and seal of office on December 28,2016.\n\nNotary Public\n\nLAURA DICKERSON\nNotary Public.State of Texas\n\nRELEASE OF LIEN\nPage I of 1\nNotary ID #12890916 3\nCommission ExpA'ARCH 09,2020\nNotary without Bond '\n\f"} }

1 回答 1


问题是 \v(垂直制表符)和其他一些控制字符(< /u020)。文档中没有提到任何内容,但 GOOGLE ML 不喜欢 ML 文本内容中的垂直制表符。不幸的是,我的 OCR 引擎往往会在这里和那里产生它们(和其他惊喜)。https://cloud.google.com/natural-language/automl/docs/prepare?_ga=2.263860879.-2053288092.1582141786下的文档有 帮助(扩展实体提取),但我看不到太多关于文本内容的内容。

于 2020-03-19T20:47:42.567 回答