3

I am writing a multi-threaded program to scrape a certain site and collect ID's. It is storing these ID's in a shared static List<string> object.

When any item is added to the List<string>, it is first checked against a HashSet<string> which contains a blacklist of already collected ID's.

I do this as follows:

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    lock (IDList)
    {
        if (IsIDBlacklisted(ID))
            return;
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    lock (Blacklist)
    {
        if (Blacklist.Contains(ID))
            return true;
    }
    return false;
 }

The Blacklist is saved to a file after finishing and is loaded every time the program starts, therefore, it will get pretty large over time (up to 50k records). Is there a more efficient way to not only store this blacklist, but also to check each ID against it?

Thanks!

4

4 回答 4

3

为了提高性能,请尝试使用ConcurrentBag<T>集合。也没有必要锁定黑名单,因为它没有被修改,例如:

private static HashSet<string> Blacklist = new HashSet<string>();
private static ConcurrentBag<string> IDList = new ConcurrentBag<string>();

public static void AddIDToIDList(string ID)
{
    if (Blacklist.Contains(ID))
    {
        return;
    }

    IDList.Add(ID);
}
于 2013-08-01T04:11:42.643 回答
2

读取操作在 HashSet 上是线程安全的,只要Blacklist没有被修改就不需要锁定它。此外,您应该锁定黑名单检查,以便减少锁定的频率,这也将提高您的性能。

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    return Blacklist.Contains(ID);
}

如果Blacklist正在修改,最好的锁定方法是使用ReaderWriterLock(如果您使用的是较新的 .NET,请使用slim 版本)

private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
private static ReaderWriterLockSlim BlacklistLock = new ReaderWriterLockSlim();

public static void AddIDToIDList(string ID)
{
    if (IsIDBlacklisted(ID))
        return;
    lock (IDList)
    {
        IDList.Add(ID);
    }
}
public static bool IsIDBlacklisted(string ID)
{
    BlacklistLock.EnterReadLock();
    try
    {
        return Blacklist.Contains(ID);
    }
    finally
    {
        BlacklistLock.ExitReadLock();
    }
}

public static bool AddToIDBlacklist(string ID)
{
    BlacklistLock.EnterWriteLock();
    try
    {
        return Blacklist.Add(ID);
    }
    finally
    {
        BlacklistLock.ExitWriteLock();
    }
}
于 2013-08-01T04:06:27.540 回答
1

在您的场景中,是的,HashSet 是最好的选择,因为它包含一个要查找的值,而Dictionary需要一个键和一个值来进行查找。

当然,正如其他人所说,如果没有修改 HashSet,则不需要锁定它。并考虑将其标记为只读。

于 2013-08-01T04:29:14.537 回答
1

两个考虑因素 - 首先,如果您像这样使用 .NET 字典(即 System.Collections.Generic.Dictionary)的索引器(而不是调用 Add() 方法):

idList[id] = id;

如果该项目尚不存在,它将添加该项目 - 否则,它将替换该键处的现有项目。其次,您可以使用 ConcurrentDictionary(在 System.Collections.Concurrent 命名空间中)来确保线程安全,因此您不必担心自己的锁定问题。同样的评论也适用于使用索引器。

于 2013-08-01T04:00:38.547 回答