python - 如何使用 python 客户端将大量行从 InfluxDB 导出到 CSV？

Question

我正在尝试将大量行（160.000.000+）从 influxDB 导出到 csv 文件。到目前为止，我只是在运行查询的机器上炸毁内存。我不知道如何在不破坏运行导出的机器的内存的情况下导出这么多行。对此有什么想法吗？？我也尝试过 CLI，但没有任何运气。

我试过下面的代码..

def export_to_csv_file(self, file_name, header, query):
    logger.info("Executing query {}".format(query))
    dfs = pd.DataFrame(self.client.query(query, chunked=True, chunk_size=10000).get_points())
    dfs.to_csv('dummy.txt', index=False, columns=header, encoding='utf-8')

关于如何成功导出数据的任何提示或提示。

score 3 · Accepted Answer

This can be done with influx_inspect CLI tool + some bash/grep/tr/cut postprocessing. It worked for me without memory problems exporting >300M rows from InfluxDB v1.2.4.

The key was to use influx_inspect - commands like influx -database 'metrics' -execute 'select * from cpu' -format 'csv' failed miserably.

Script like this will create files with your data in influx lineprotocol format:

#!/bin/bash
month=2017-04

db=YOUR_DBNAME
rp=autogen
datadir=/data/influxdb/data
waldir=/data/influxdb/wal
outdir=/somepath/influx_export

for d in 0{1..9} 10 ; do
    echo $(date) Running time influx_inspect export -database $db -retention $rp -datadir $datadir -waldir $waldir -compress -start ${month}-${d}T00:00:00Z -end ${month}-${d}T23:59:59Z -out $outdir/export.${month}-${d}.lineproto.gz
    time influx_inspect export -database $db -retention $rp -datadir $datadir -waldir $waldir -compress -start ${month}-${d}T00:00:00Z -end ${month}-${d}T23:59:59Z -out $outdir/export.${month}-${d}.lineproto.gz
    echo $(date) Done
done

Then these lineproto files can be converted to CSV with postprocessing step.

In my case data lines in output file looked like:

# some header lines then data lines:
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=87 packets=1030000i 1488358500000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=87 packets=2430000i 1488358800000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 bytes=400000i 1488355200000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 bytes=400000i 1488356400000000000
device_interfaces,device=10.99.0.6,iface_in=998,iface_out=875 packets=10000i 1488355200000000000

Bad thing here is that measurement's data fields come in separate rows and in random order.

In my case conversion script just put each measurement data field (packets and bytes) to a separate CSV file (I joined them back later in database). You may need to customize or write your own.

MEASUREMENT=YOUR_MEASUREMENT_NAME
for file in *lineproto.gz ; do
   echo -e "--- $(date) Processing file $file ...."

    for field in packets bytes ; do
       # uncompress, strip some header lines, delete junk chars and measurement name, replace spaces with comma
       gzip -dc ${file} | grep "${MEASUREMENT},device" | grep $field | tr -d a-zA-Z_=- | tr -s ' ' , | cut -b1 --complement >> field_${field}.csv
       echo -e "Conversion for $db field ${field} done"
    done
    echo -e "--- File $file processed at $(date)"
done

python - 如何使用 python 客户端将大量行从 InfluxDB 导出到 CSV？

1 回答 1

Related

Reference