java - 如何同时在两个数组中找到相同的字节 [] 对象？

Question

我正在尝试对哈希实施冲突攻击（我正在访问“密码学”课程）。因此，我有两个哈希数组（= byte-sequences byte[]）并且想要找到两个数组中都存在的哈希。经过一些研究和大量思考，我确信单核机器上的最佳解决方案将是HashSet（添加第一个数组的所有元素并检查contains第二个数组的元素是否已经存在）。

但是，我想实现一个并发解决方案，因为我可以访问一台具有 8 个内核和 12 GB RAM 的机器。我能想到的最好的解决方案是 ConcurrentHashSet，它可以通过Collections.newSetFromMap(new ConcurrentHashMap<A,B>()). 使用这个数据结构，我可以并行添加第一个数组的所有元素，并且 - 在添加了所有元素之后 - 我可以同时检查 viacontains是否有相同的哈希值。

所以我的问题是：你知道为这个确切问题设计的算法吗？如果没有，您是否有使用此类 ConcurrentHashSet 解决问题和有效运行时复杂性的经验？或者你能推荐另一个可以帮助我的预建数据结构吗？

PS：如果有人对细节感兴趣：我打算使用Skandium来并行化我的程序。

score 5 · Accepted Answer

我认为使用任何形式的HashMap. 我猜你正在计算各种数据的多字节散列，这些已经是hashes，不需要对它们执行任何更多的散列。

尽管您没有说明，但我猜您的哈希值是byte序列。显然，trie或dawg都是存储这些文件的理想选择。

因此，我建议您实现 atrie/dawg并使用它将所有哈希值存储在第一个数组中。然后，您可以并行使用所有计算能力来查找 this 中第二个数组中的每个元素trie。不需要锁。

添加

Dawg这是我拼凑起来的一个简单实现。它似乎工作。

public class Dawg {
  // All my children.
  Dawg[] children = new Dawg[256];
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    Dawg loc = find ( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add(word.getBytes());
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte [] word ) {
    // Finds its location, no growing allowed.
    Dawg d = find ( word, 0, false );
    return d != null && d.isLeaf; 
  }

  // String form.
  public boolean contains ( String word ) {
    return contains(word.getBytes());
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private Dawg find ( byte [] word, int i, boolean grow ) {
    Dawg child = children[word[i]];
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new Dawg();
        children[word[i]] = child;
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find(word, i+1, grow);
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    Dawg d = new Dawg();
    d.add("H");
    d.add("Hello");
    d.add("World");
    d.add("Hell");
    System.out.println("Hello is "+(d.contains("Hello")?"in":"out"));
    System.out.println("World is "+(d.contains("World")?"in":"out"));
    System.out.println("Hell is "+(d.contains("Hell")?"in":"out"));
    System.out.println("Hal is "+(d.contains("Hal")?"in":"out"));
    System.out.println("Hel is "+(d.contains("Hel")?"in":"out"));
    System.out.println("H is "+(d.contains("H")?"in":"out"));
  }
}

添加

这可能是并发无锁版本的良好开端。众所周知，这些东西很难测试，所以我不能保证这会奏效，但在我看来，它当然应该。

import java.util.concurrent.atomic.AtomicReferenceArray;


public class LFDawg {
  // All my children.
  AtomicReferenceArray<LFDawg> children = new AtomicReferenceArray<LFDawg> ( 256 );
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    LFDawg loc = find( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add( word.getBytes() );
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte[] word ) {
    // Finds its location, no growing allowed.
    LFDawg d = find( word, 0, false );
    return d != null && d.isLeaf;
  }

  // String form.
  public boolean contains ( String word ) {
    return contains( word.getBytes() );
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private LFDawg find ( byte[] word, int i, boolean grow ) {
    LFDawg child = children.get( word[i] );
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new LFDawg();
        if ( !children.compareAndSet( word[i], null, child ) ) {
          // Someone else got there before me. Get the one they set.
          child = children.get( word[i] );
        }
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find( word, i + 1, grow );
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    LFDawg d = new LFDawg();
    d.add( "H" );
    d.add( "Hello" );
    d.add( "World" );
    d.add( "Hell" );
    System.out.println( "Hello is " + ( d.contains( "Hello" ) ? "in" : "out" ) );
    System.out.println( "World is " + ( d.contains( "World" ) ? "in" : "out" ) );
    System.out.println( "Hell is " + ( d.contains( "Hell" ) ? "in" : "out" ) );
    System.out.println( "Hal is " + ( d.contains( "Hal" ) ? "in" : "out" ) );
    System.out.println( "Hel is " + ( d.contains( "Hel" ) ? "in" : "out" ) );
    System.out.println( "H is " + ( d.contains( "H" ) ? "in" : "out" ) );
  }
}

score 0 · Accepted Answer

一种更简单的方法是将第一个数组拆分为 N 个相等（或接近相等）的部分（具有 8 个内核，n=8 似乎是合理的）。然后以“正常”方式求解程序，通过查看第二个数组中的任何散列是否存在于 N 个较小的子第一数组中。这可以并行完成。

也就是说，我以前从未听说过尝试/dawgs，我发现主要讨论引人入胜且内容丰富。（我主要处理数字，而不是文字）

这假设 byte[] 散列具有一些有限的、较短的长度，因此您真的可以拆分原始文件以并行处理。是这样吗？

编辑添加

有关此想法的示例，请参阅由 Wen-Mei W. Hwu 编辑的GPU Graphics Gems第 11 章，由 Ligowski、Rudnicki、Liu 和 Schmidt 撰写的文章。他们通过将庞大的单个数据库分成许多较小的部分来并行化大规模的蛋白质序列数据库搜索，然后在每个子部分上运行正常算法。我喜欢这句话。“所描述的算法是令人尴尬的并行”。在他们的案例中，他们使用了 CUDA，并且必须进行大量内存优化，但该原则仍应适用于多核机器。

半伪代码如下。我将对传入的 byte[] 哈希使用列表，希望没关系

原创，1个核心方法

originalProcess(List<byte[]> list1, List<byte[]> list2) {
   HashSet<byte[]> bigHugeHashOfList1 = new HashSet<byte[]>();
   bigHugeHashOfList1.addAll(list1);
   for (byte[] hash : list2)
      if (bigHugeHashOfList1.contains(hash)
         // do something
}

新方法。使用完全相同的处理方法（稍后）。这里没有DAWGS或TRIES...

preprocess(List<byte[]> list1, List<byte[]> list2) {
   List<byte[]>[] splitLists = new ArrayList<byte[]>[8];
   for (int i=0; i<8; i++)
      splitLists[i] = new ArrayList<byte[]>();
   for (byte[] hash : list1) {
      int idx = hash[0]&7; // I'm taking the 3 low order bits, YMMV
      splitLists[idx].add(hash);
      // a minor speedup would be to create the HashSet here instead of in originalProcess()
   }

   // now, using your favorite parallel/concurrency technique,
   // do the equivalent of
   for (int i=0; i<8; i++)
      originalProcess(splitLists[i], list2);
}

java - 如何同时在两个数组中找到相同的字节 [] 对象？

2 回答 2

Related

Reference