marklogic - mlcp 不会在目录中加载大量文件

Question

请参阅下面的编辑

我们使用 MarkLogic Content Pump 将数据加载到 ML8 数据库中。我们有一个一切正常的开发环境和一个生产环境，其中 mlcp 不会通过对要处理的文件数量的评估。

我们有 210 万个 JSON 文档要加载。

在开发服务器（ML8 + CentOS6）上，我们看到：

15/07/13 13:19:35 INFO contentpump.ContentPump: Hadoop library version: 2.0.0-alpha
15/07/13 13:19:35 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/13 13:19:35 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
15/07/13 13:23:06 INFO input.FileInputFormat: Total input paths to process : 2147329
15/07/13 13:24:08 INFO contentpump.LocalJobRunner:  completed 0%
15/07/13 13:34:43 INFO contentpump.LocalJobRunner:  completed 1%
15/07/13 13:43:42 INFO contentpump.LocalJobRunner:  completed 2%
15/07/13 13:51:15 INFO contentpump.LocalJobRunner:  completed 3%

完成正常，数据加载正常。

现在我们在我们得到的产品服务器（ML8 + CentOS 7）的不同机器上使用相同的数据

15/07/14 17:02:21 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/14 17:02:21 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.

除了不同的操作系统，我们还在 de prod 服务器 2.6.0 上安装了更新版本的 mlcp，而不是 2.0.0。如果我们使用相同的命令来导入只有 2000 个文件的目录，它适用于 prod ...

计算要处理的文件数时，作业卡住了......

可能是什么问题？

开始编辑我们将 mlcp 放入 DEBUG 并用一个小的 samle.zip 进行测试

结果：

[ashraf@77-72-150-125 ~]$ mlcp.sh import -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true  -mode local -output_uri_replace  "\".*,''\"" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Command: IMPORT
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Arguments: -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true -mode local -output_uri_replace ".*,''" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read 
15/07/16 16:36:31 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Running in: localmode
15/07/16 16:36:31 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/16 16:36:32 DEBUG contentpump.LocalJobRunner: Thread pool size: 4
15/07/16 16:36:32 INFO input.FileInputFormat: Total input paths to process : 1
15/07/16 16:36:33 DEBUG contentpump.LocalJobRunner: Thread Count for Split#0 : 4
15/07/16 16:36:33 DEBUG contentpump.CompressedDocumentReader: Starting file:/home/ashraf/sample2.zip
15/07/16 16:36:33 DEBUG contentpump.MultithreadedMapper: Running with 4 threads
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:34 INFO contentpump.LocalJobRunner:  completed 0%
15/07/16 16:36:39 INFO contentpump.LocalJobRunner:  completed 100%
2015-07-16 16:39:11.483 WARNING [19] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
15/07/16 16:39:12 DEBUG contentpump.CompressedDocumentReader: Closing file:/home/ashraf/sample2.zip
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: com.marklogic.contentpump.ContentPumpStats: 
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: ATTEMPTED_INPUT_RECORD_COUNT: 1993
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: SKIPPED_INPUT_RECORD_COUNT: 0
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: Total execution time: 160 sec

只有第一个 json 文件在数据库中，其余的被丢弃/丢失？

JSON文件中的换行符是否存在问题？

(AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''

任何提示都会很棒。

雨果

score 1 · Accepted Answer

我真的不知道发生了什么。我认为支持人员会对这种情况感兴趣。你能给他们或我发一封包含更多细节的邮件（也许还有文件）。

作为一种解决方法：在 prod 服务器上使用与在 dev 上使用相同的MLCP版本应该不难，只需将它放在另一个旁边（或任何你喜欢的地方），并确保你引用那个（提示：在 Roxy 你有mlcp-home设置）。

您还可以考虑压缩 json 文档并使用该-input_compressed选项。

！

marklogic - mlcp 不会在目录中加载大量文件

1 回答 1

Related

Reference