我XmlSource.from
用来读取存储在 Cloud Storage 存储桶中的 XML 文件。
XmlSource<Data> source = XmlSource.<Data>from("gs://<my-url>/TestData.xml")
.withRootElement("data")
.withRecordElement("record")
.withRecordClass(Data.class);
p.apply(Read.from(source))
.apply(RemoveDuplicates.<Data>create())
.apply(ParDo.of(new XMLPipeline.CreateItemQtyMapping()))
.apply(Combine.<String, Integer>perKey(new SumIntegers()))
.apply("FormatResults", MapElements.via(
new SimpleFunction<KV<String, Integer>, String>() {
@Override
public String apply(KV<String, Integer> input) {
return input.getKey() + "," + input.getValue();
}
}))
.apply(TextIO.Write.to("gs://<my-url>.appspot.com/pos-pipeline-output/ItemCounts"));
p.run();
但我得到了这个例外:
017-01-09T14:01:31.107Z: Error: (c88c756cabe0dbec): java.io.IOException: Failed to start reading from source: StaticValueProvider{value=gs://<my-url>/TestData.xml} range [48524, 97048)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:534)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:387)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:217)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:172)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:159)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: com.sun.xml.internal.stream.XMLInputFactoryImpl cannot be cast to org.codehaus.stax2.XMLInputFactory2
at com.google.cloud.dataflow.sdk.io.XmlSource$XMLReader.setUpXMLParser(XmlSource.java:490)
at com.google.cloud.dataflow.sdk.io.XmlSource$XMLReader.startReading(XmlSource.java:356)
at com.google.cloud.dataflow.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
at com.google.cloud.dataflow.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:281)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:531)
... 14 more
这些是我的 pom.xml 中的依赖项:
<dependencies>
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>0.7.0</version>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>stax2-api</artifactId>
<version>4.0.0</version>
</dependency>
我不确定这里有什么问题。有人可以指点一下吗?
谢谢,
阿布舍克