python - 在 Python 中使用 mdbtools 从 .mdb 文件中提取和排序数据

Question

我对 Python 很陌生，所以任何帮助都将不胜感激。我正在尝试mdbtools在 Linux 上使用 2000 个 .mdb 文件提取和排序数据。到目前为止，我能够获取 .mdb 文件并将所有表转储到 .csv 中。它造成了巨大的混乱，因为有很多文件需要处理。

我需要的是从特定表中提取特定的排序数据。例如，我需要名为“Voltage”的表。该表由许多循环组成，每个循环也有几行。周期通常按时间顺序排列，但在某些情况下，时间戳会延迟记录。就像周期的第一行可能比周期 1 的第一行有更晚的时间。我需要根据第一个或最后五个周期的时间提取周期的最新行。例如，在下表中，我需要第二行。

Cycle#    Time        Data
  1      100.59        34
  1      101.34        54
  1      98.78         45  
  2      
  2
  2   ...........

这是我使用的脚本。我正在使用该命令python extract.py table_files.mdb.但我希望仅使用 ./extract.py 调用该脚本。文件名的路径应该在脚本本身中。

import sys, subprocess, os

DATABASE = sys.argv[1]

subprocess.call(["mdb-schema", DATABASE, "mysql"])

# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
                               stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()

print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()

# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
    if table != '':
        filename = table.replace(" ","_") + ".csv"
        file = open(filename, 'w')
        print("Dumping " + table)
        contents = subprocess.Popen(["mdb-export", DATABASE, table],
                                    stdout=subprocess.PIPE).communicate()[0]
        file.write(contents)
        file.close()

score 3 · Accepted Answer

就个人而言，我不会花很多时间在试图获得并mdbtools一起工作上大惊小怪。正如 Pedro 在他的评论中所建议的那样，如果您可以将表转储到 CSV 文件，那么您可能只需将这些 CSV 文件导入 SQLite 或 MySQL 即可节省相当多的时间，即比使用更强大的东西在 Linux 平台上。unixODBCpyodbcmdb-exportmdbtools

几点建议：

.mdb鉴于所涉及的文件（以及因此文件）的绝对数量.csv，您可能希望将 CSV 数据导入到一个大表中，并使用一个附加列来指示源文件名。这将比大约 2000 个单独的表更容易管理。
在新数据库中创建目标表时，您可能希望对 [Time] 列使用decimal（而不是float）数据类型。
同时，将 [Cycle#] 列重命名为 [Cycle]。列名中的“有趣的字符”可能是一个真正的麻烦。

最后，要为给定的 [SourceFile] 和 [Cycle] 选择“最后”读数（最大 [Time] 值），您可以使用如下查询：

SELECT
    v1.SourceFile, 
    v1.Cycle,
    v1.Time, 
    v1.Data 
FROM 
    Voltage v1 
    INNER JOIN 
    (
        SELECT
            SourceFile, 
            Cycle, 
            MAX([Time]) AS MaxTime 
        FROM Voltage 
        GROUP BY SourceFile, Cycle
    ) v2 
        ON v1.SourceFile=v2.SourceFile 
           AND v1.Cycle=v2.Cycle 
           AND v1.Time=v2.MaxTime

score 2 · Accepted Answer

为了将它直接带到 python3 中的 Pandas，我写了这个小片段

import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
    subprocess.call(["mdb-schema", database_path, "mysql"])
    # Get the list of table names with "mdb-tables"
    table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
                                   stdout=subprocess.PIPE).communicate()[0]
    tables = table_names.splitlines()
    sys.stdout.flush()
    # Dump each table as a stringio using "mdb-export",
    out_tables = {}
    for rtable in tables:
        table = rtable.decode()
        if VERBOSE: print('running table:',table)
        if table != '':
            if VERBOSE: print("Dumping " + table)
            contents = subprocess.Popen(["mdb-export", database_path, table],
                                        stdout=subprocess.PIPE).communicate()[0]
            temp_io = StringIO(contents.decode())
            print(table, temp_io)
            out_tables[table] = pd.read_csv(temp_io)
    return out_tables

python - 在 Python 中使用 mdbtools 从 .mdb 文件中提取和排序数据

2 回答 2

Related

Reference