java - 如何对文本文件执行二进制搜索

Question

我在我的 Android 应用程序中使用了一个大文本文件 (5Mb)。我将文件创建为预先排序的字符串列表，并且文件一旦创建就不会更改。如何在不逐行读取匹配字符串的情况下对此文件的内容执行二进制搜索？

score 7 · Accepted Answer

由于文件的内容没有改变，您可以将文件分成多个部分。说 AG、HN、0-T 和 UZ。这使您可以检查第一个字符并立即将可能的集合剪切为原始大小的四分之一。现在线性搜索将不会花费很长时间，或者可以选择读取整个文件。如果 n/4 仍然太大，这个过程可以扩展，但想法是一样的。将搜索细分构建到文件结构中，而不是尝试全部在内存中完成。

score 1 · Accepted Answer

一个 5MB 的文件并没有那么大 - 您应该能够将每一行读入一个String[]数组，然后您可以使用它java.util.Arrays.binarySearch()来查找您想要的行。这是我推荐的方法。

如果您不想将整个文件读入您的应用程序，那么它会变得更加复杂。如果文件的每一行长度相同，并且文件已经排序，那么您可以在 RandomAccessFile 中打开文件并自己执行二进制搜索，seek()如下所示...

// open the file for reading
RandomAccessFile raf = new RandomAccessFile("myfile.txt","r");
String searchValue = "myline";
int lineSize = 50;
int numberOfLines = raf.length() / lineSize;

// perform the binary search...
byte[] lineBuffer = new byte[lineSize];
int bottom = 0;
int top = numberOfLines;
int middle;
while (bottom <= top){
  middle = (bottom+top)/2;
  raf.seek(middle*lineSize); // jump to this line in the file
  raf.read(lineBuffer); // read the line from the file
  String line = new String(lineBuffer); // convert the line to a String

  int comparison = line.compareTo(searchValue);
  if (comparison == 0){
    // found it
    break;
    }
  else if (comparison < 0){
    // line comes before searchValue
    bottom = middle + 1;
    }
  else {
    // line comes after searchValue
    top = middle - 1;
    }
  }

raf.close(); // close the file when you're finished

但是，如果文件没有固定宽度的行，那么如果不先将其加载到内存中，您将无法轻松执行二进制搜索，因为您无法像使用 fixed 那样快速跳转到文件中的特定行-宽度线。

score 1 · Accepted Answer

在统一字符长度的文本文件中，您可以按字符方式寻找有问题的间隔的中间，开始读取字符直到您击中分隔符，然后使用后续字符串作为元素中间值的近似值。但是，在android中这样做的问题是您显然无法随机访问资源（尽管我想您每次都可以重新打开它）。此外，这种技术不能推广到地图和其他类型的集合。

另一种选择是（使用RandomAccessFile）在文件的开头写入一个整数的“数组” - 每个字符串一个 - 然后返回并使用相应字符串的位置更新它们。再次搜索将需要跳来跳去。

我会做（并且在我自己的应用程序中做的）是在文件中实现哈希集。这个确实与树分开链接。

import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Set;

class StringFileSet {

    private static final double loadFactor = 0.75;

    public static void makeFile(String fileName, String comment, Set<String> set) throws IOException {
        new File(fileName).delete();
        RandomAccessFile fout = new RandomAccessFile(fileName, "rw");

        //Write comment
        fout.writeUTF(comment);

        //Make bucket array
        int numBuckets = (int)(set.size()/loadFactor);

        ArrayList<ArrayList<String>> bucketArray = new ArrayList<ArrayList<String>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            bucketArray.add(new ArrayList<String>());
        }

        for (String key : set){
            bucketArray.get(Math.abs(key.hashCode()%numBuckets)).add(key);
        }

        //Sort key lists in preparation for creating trees
        for (ArrayList<String> keyList : bucketArray){
            Collections.sort(keyList);
        }

        //Make queues in preparation for creating trees
        class NodeInfo{

            public final int lower;
            public final int upper;
            public final long callingOffset;

            public NodeInfo(int lower, int upper, long callingOffset){
                this.lower = lower;
                this.upper = upper;
                this.callingOffset = callingOffset;
            }

        }

        ArrayList<LinkedList<NodeInfo>> queueList = new ArrayList<LinkedList<NodeInfo>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            queueList.add(new LinkedList<NodeInfo>());
        }

        //Write bucket array
        fout.writeInt(numBuckets);
        for (int index = 0; index < numBuckets; index++){
            queueList.get(index).add(new NodeInfo(0, bucketArray.get(index).size()-1, fout.getFilePointer()));
            fout.writeInt(-1);
        }

        //Write trees
        for (int bucketIndex = 0; bucketIndex < numBuckets; bucketIndex++){
            while (queueList.get(bucketIndex).size() != 0){
                NodeInfo nodeInfo = queueList.get(bucketIndex).poll();
                if (nodeInfo.lower <= nodeInfo.upper){
                    //Set respective pointer in parent node
                    fout.seek(nodeInfo.callingOffset);
                    fout.writeInt((int)(fout.length() - (nodeInfo.callingOffset + 4))); //Distance instead of absolute position so that the get method can use a DataInputStream
                    fout.seek(fout.length());

                    int middle = (nodeInfo.lower + nodeInfo.upper)/2;

                    //Key
                    fout.writeUTF(bucketArray.get(bucketIndex).get(middle));

                    //Left child
                    queueList.get(bucketIndex).add(new NodeInfo(nodeInfo.lower, middle-1, fout.getFilePointer()));
                    fout.writeInt(-1);

                    //Right child
                    queueList.get(bucketIndex).add(new NodeInfo(middle+1, nodeInfo.upper, fout.getFilePointer()));
                    fout.writeInt(-1);
                }
            }
        }

        fout.close();
    }

    private final String fileName;
    private final int numBuckets;
    private final int bucketArrayOffset;

    public StringFileSet(String fileName) throws IOException {
        this.fileName = fileName;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(fileName)));

        short numBytes = fin.readShort();
        fin.skipBytes(numBytes);
        this.numBuckets = fin.readInt();
        this.bucketArrayOffset = numBytes + 6;

        fin.close();
    }

    public boolean contains(String key) throws IOException {
        boolean containsKey = false;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(this.fileName)));

        fin.skipBytes(4*(Math.abs(key.hashCode()%this.numBuckets)) + this.bucketArrayOffset);

        int distance = fin.readInt();
        while (distance != -1){
            fin.skipBytes(distance);

            String candidate = fin.readUTF();
            if (key.compareTo(candidate) < 0){
                distance = fin.readInt();
            }else if (key.compareTo(candidate) > 0){
                fin.skipBytes(4);
                distance = fin.readInt();
            }else{
                fin.skipBytes(8);
                containsKey = true;
                break;
            }
        }

        fin.close();

        return containsKey;
    }

}

一个测试程序

import java.io.File;
import java.io.IOException;
import java.util.HashSet;

class Test {
    public static void main(String[] args) throws IOException {
        HashSet<String> stringMemorySet = new HashSet<String>();

        stringMemorySet.add("red");
        stringMemorySet.add("yellow");
        stringMemorySet.add("blue");

        StringFileSet.makeFile("stringSet", "Provided under ... included in all copies and derivatives ...", stringMemorySet);
        StringFileSet stringFileSet = new StringFileSet("stringSet");

        System.out.println("orange -> " + stringFileSet.contains("orange"));
        System.out.println("red -> " + stringFileSet.contains("red"));
        System.out.println("yellow -> " + stringFileSet.contains("yellow"));
        System.out.println("blue -> " + stringFileSet.contains("blue"));

        new File("stringSet").delete();

        System.out.println();
    }
}

如果您为 android 修改它，您还需要将Context 传递给它，以便它可以访问 getResources() 方法。

您可能还想阻止 android 构建工具压缩文件，这显然只能通过将文件的扩展名更改为 jpg 之类的东西来完成 - 如果您正在使用 GUI。这使我的应用程序中的过程快了大约 100 到 300 倍。

您还可以考虑使用NDK 给自己更多内存。

score 1 · Accepted Answer

这是我快速整理的内容。它使用两个文件，一个带有单词，另一个带有偏移量。偏移文件的格式是这样的：前 10 位包含字长，后 22 位包含偏移（字位置，例如 aaah 为 0，abasementable 为 4，等等）。它以大端（java标准）编码。希望它可以帮助某人。

word.dat：

aahabasementableabnormalabnormalityabnormalityabortionistortion-rightsabracadabra

wordx.dat：

00 80 00 00 01 20 00 04 00 80 00 0D 01 00 00 11   _____ __________
01 60 00 19 01 60 00 24 01 E0 00 2F 01 60 00 3E   _`___`_$___/_`_>

我在 C# 中创建了这些文件，但这是它的代码（它使用一个 txt 文件，其中的单词由 crlfs 分隔）

static void Main(string[] args)
{
    const string fIn = @"C:\projects\droid\WriteFiles\input\allwords.txt";
    const string fwordxOut = @"C:\projects\droid\WriteFiles\output\wordx.dat";
    const string fWordOut = @"C:\projects\droid\WriteFiles\output\word.dat";

    int i = 0;
    int offset = 0;
    int j = 0;
    var lines = File.ReadLines(fIn);

    FileStream stream = new FileStream(fwordxOut, FileMode.Create, FileAccess.ReadWrite);
    using (EndianBinaryWriter wwordxOut = new EndianBinaryWriter(EndianBitConverter.Big, stream))
    {
        using (StreamWriter wWordOut = new StreamWriter(File.Open(fWordOut, FileMode.Create)))
        {
            foreach (var line in lines)
            {
                wWordOut.Write(line);
                i = offset | ((int)line.Length << 22); //first 10 bits to the left is the word size
                offset = offset + (int)line.Length;
                wwordxOut.Write(i);
                //if (j == 7)
                  //  break;
                j++;
            }
        }
    }
}

这是用于二进制文件搜索的 Java 代码：

public static void binarySearch() {
    String TAG = "TEST";
    String wordFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/word.dat";
    String wordxFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/wordx.dat";

    String target = "abracadabra"; 
    boolean targetFound = false; 
    int searchCount = 0; 

    try {
        RandomAccessFile raf = new RandomAccessFile(wordxFilePath, "r");
        RandomAccessFile rafWord = new RandomAccessFile(wordFilePath, "r");
        long low = 0;
        long high = (raf.length() / 4) - 1;
        int cur = 0;
        long wordOffset = 0;
        int len = 0;

        while (high >= low) {
            long mid = (low + high) / 2;
            raf.seek(mid * 4);
            cur = raf.readInt();
            Log.v(TAG + "-cur", String.valueOf(cur));

            len = cur >> 22; //word length

            cur = cur & 0x3FFFFF;  //first 10 bits are 0

            rafWord.seek(cur);
            byte [] bytes = new byte[len];

            wordOffset = rafWord.read(bytes, 0, len);
            Log.v(TAG + "-wordOffset", String.valueOf(wordOffset));

            searchCount++;

            String str = new String(bytes);

            Log.v(TAG, str);

            if (target.compareTo(str) < 0) {
                high = mid - 1;
            } else if (target.compareTo(str) == 0) {
                targetFound = true;
                break;
            } else {
                low = mid + 1;
            }
        }

        raf.close();
        rafWord.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    if (targetFound == true) {
        Log.v(TAG + "-found " , String.valueOf(searchCount));
    } else {
        Log.v(TAG + "-not found " , String.valueOf(searchCount));
    }

}

score 0 · Accepted Answer

尽管这听起来有点矫枉过正，但不要将执行此操作所需的数据存储为平面文件。建一个数据库，查询数据库中的数据。这应该既有效又快速。

score 0 · Accepted Answer

这是一个我认为有效的功能（在实践中使用它）。线条可以有任何长度。您必须提供一个名为“nav”的 lambda 来进行实际的行检查，以便您可以灵活地处理文件的顺序（区分大小写、不区分大小写、按某个字段排序等）。

import java.io.File;
import java.io.RandomAccessFile;

class main {
    // returns pair(character range in file, line) or null if not found
    // if no exact match found, return line above
    // nav takes a line and returns -1 (move up), 0 (found) or 1 (move down)
    // The line supplied to nav is stripped of the trailing \n, but not the \r
    // UTF-8 encoding is assumed

    static Pair<LongRange, String> binarySearchForLineInTextFile(File file, IF1<String, Integer> nav) {
        long length = l(file);
        int bufSize = 1024;
        RandomAccessFile raf = randomAccessFileForReading(file);
        try {
            long min = 0, max = length;
            int direction = 0;
            Pair<LongRange, String> possibleResult = null;
            while (min < max) {
                ping();
                long middle = (min + max) / 2;
                long lineStart = raf_findBeginningOfLine(raf, middle, bufSize);
                long lineEnd = raf_findEndOfLine(raf, middle, bufSize);

                String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (lineEnd - 1 - lineStart)));
                direction = nav.get(line);
                possibleResult = (Pair<LongRange, String>) new Pair(new LongRange(lineStart, lineEnd), line);

                if (direction == 0) return possibleResult;
                // asserts are to assure that loop terminates
                if (direction < 0) max = assertLessThan(max, lineStart);
                else min = assertBiggerThan(min, lineEnd);

            }


            if (direction >= 0) return possibleResult;

            long lineStart = raf_findBeginningOfLine(raf, min - 1, bufSize);
            String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (min - 1 - lineStart)));

            return new Pair(new LongRange(lineStart, min), line);
        } finally {
            _close(raf);
        }
    }


    static int l(byte[] a) {
        return a == null ? 0 : a.length;
    }

    static long l(File f) {
        return f == null ? 0 : f.length();
    }


    static RandomAccessFile randomAccessFileForReading(File path) {
        try {

            return new RandomAccessFile(path, "r");


        } catch (Exception __e) {
            throw rethrow(__e);
        }
    }

    // you can change this function to allow interrupting long calculations from the outside. just throw a RuntimeException.
    static boolean ping() {
        return true;
    }


    static long raf_findBeginningOfLine(RandomAccessFile raf, long pos, int bufSize) {
        try {
            byte[] buf = new byte[bufSize];
            while (pos > 0) {
                long start = Math.max(pos - bufSize, 0);
                raf.seek(start);
                raf.readFully(buf, 0, (int) Math.min(pos - start, bufSize));
                int idx = lastIndexOf_byteArray(buf, (byte) '\n');
                if (idx >= 0) return start + idx + 1;
                pos = start;
            }
            return 0;
        } catch (Exception __e) {
            throw rethrow(__e);
        }
    }

    static long raf_findEndOfLine(RandomAccessFile raf, long pos, int bufSize) {
        try {
            byte[] buf = new byte[bufSize];
            long length = raf.length();
            while (pos < length) {
                raf.seek(pos);
                raf.readFully(buf, 0, (int) Math.min(length - pos, bufSize));
                int idx = indexOf_byteArray(buf, (byte) '\n');
                if (idx >= 0) return pos + idx + 1;
                pos += bufSize;
            }
            return length;
        } catch (Exception __e) {
            throw rethrow(__e);
        }
    }

    static String fromUtf8(byte[] bytes) {
        try {
            return bytes == null ? null : new String(bytes, "UTF-8");
        } catch (Exception __e) {
            throw rethrow(__e);
        }
    }

    static byte[] raf_readFilePart(RandomAccessFile raf, long start, int l) {
        try {
            byte[] buf = new byte[l];
            raf.seek(start);
            raf.readFully(buf);
            return buf;
        } catch (Exception __e) {
            throw rethrow(__e);
        }
    }

    static <A> A assertLessThan(A a, A b) {
        assertTrue(cmp(b, a) < 0);
        return b;
    }

    static <A> A assertBiggerThan(A a, A b) {
        assertTrue(cmp(b, a) > 0);
        return b;
    }

    static void _close(AutoCloseable c) {
        try {
            if (c != null)
                c.close();
        } catch (Throwable e) {
            throw rethrow(e);
        }
    }


    static RuntimeException rethrow(Throwable t) {

        throw t instanceof RuntimeException ? (RuntimeException) t : new RuntimeException(t);
    }

    static int lastIndexOf_byteArray(byte[] a, byte b) {
        for (int i = l(a) - 1; i >= 0; i--)
            if (a[i] == b)
                return i;
        return -1;
    }

    static int indexOf_byteArray(byte[] a, byte b) {
        int n = l(a);
        for (int i = 0; i < n; i++)
            if (a[i] == b)
                return i;
        return -1;
    }

    static boolean assertTrue(boolean b) {
        if (!b)
            throw fail("oops");
        return b;
    }

    static int cmp(Object a, Object b) {
        if (a == null) return b == null ? 0 : -1;
        if (b == null) return 1;
        return ((Comparable) a).compareTo(b);
    }

    static RuntimeException fail(String msg) {
        throw new RuntimeException(msg == null ? "" : msg);
    }


    final static class LongRange {
        long start, end;

        LongRange(long start, long end) {
            this.end = end;
            this.start = start;
        }

        public String toString() {
            return "[" + start + ";" + end + "]";
        }
    }

    interface IF1<A, B> {
        B get(A a);
    }

    static class Pair<A, B> {
        A a;
        B b;

        Pair(A a, B b) {
            this.b = b;
            this.a = a;
        }

        public String toString() {
            return "<" + a + ", " + b + ">";
        }
    }
}

java - 如何对文本文件执行二进制搜索

6 回答 6

Related

Reference