url - 哈希集处理以避免在迭代期间陷入循环

Question

我正在从事图像挖掘项目，我使用 Hashset 而不是数组来避免在收集 url 时添加重复的 url，我到达了代码点来迭代包含主要 url 的 Hashset，并在迭代中下载主 URL 的页面并将它们添加到 Hashet，然后继续，在迭代期间我应该排除每个扫描的 url，并且还排除（删除）每个以 jpg 结尾的 url，直到 url 计数的 Hashet 达到 0，问题是我在这个迭代中面临着无休止的循环，在那里我可能会得到 url（我们称之为 X）

1- 我扫描 url X 的页面 2- 获取页面 X 的所有 url（通过应用过滤器） 3- 使用 unioinwith 将 url 添加到 Hashset 4- 删除扫描的 url X

当其中一个 URL Y 被扫描时再次带 X 时，问题就出现了

我应该使用字典和密钥作为“扫描”吗？我会尝试在这里发布结果，抱歉，在我发布问题后我想到了......

我设法为一个 url 解决了它，但它似乎与其他 url 一起发生以生成循环，所以即使在删除链接后如何处理 Hashset 以避免重复，我希望我的观点很清楚。

while (URL_Can.Count != 0)
 {

                  tempURL = URL_Can.First();

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL);
                        URL_Can.Remove(tempURL);

                    }
                    else
                    {

                        if (ExtractUrlsfromLink(client, tempURL, filterlink1).Contains(toAvoidLoopinLinks))
                        {

                            URL_Can.Remove(tempURL);

                            URL_Can.Remove(toAvoidLoopinLinks);
                        }
                        else 
                        {
                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink1));

                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink2));

                            URL_Can.Remove(tempURL);

                            richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }

                    }

                   toAvoidLoopinLinks = tempURL;

                }

score 0 · Accepted Answer

谢谢大家，我设法使用 Dictionary 而不是 Hashset 解决了这个问题，并使用 Key 来保存 URL 和保存 int 的值，如果 url 被扫描，则为 1，如果 url 仍未处理，则为 0，下面是我的代码。我使用另一个 Dictionary "URL_CANtoSave" 来保存以 jpg "my target" 结尾的 url……而这个 While.. 循环可以循环，直到根据您在过滤器字符串变量中指定的值，网站的所有 url 都用完您相应地解析网址。

因此，要打破循环，您可以指定要在 URL_CantoSave 中获取的图像 url 数量。

  return Task.Factory.StartNew(() =>
        {
            try
            {


                string tempURL;

                int i = 0;

// 我曾经设置 Dictionary Key 的值，1 或 0（1 表示已扫描，0 表示尚未扫描，并根据您在其他词典

               while (URL_Can.Values.Where(value => value.Equals(0)).Any())


                {

// 取出 1 个密钥并将其放入临时变量中

                    tempURL = URL_Can.ElementAt(i).Key;

// 检查它是否以您的目标文件扩展名结尾。在这种情况下图像文件

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL,0);

                        URL_Can.Remove(tempURL);

                    }

// 如果不是图片，则根据 url 下载页面并继续分析

                    else
                    {

// 如果之前没有扫描过 url

                        if (URL_Can[tempURL] != 1) 
                        {

// 这里看起来有点复杂，其中 Add2Dic 是添加到字典而不再次添加键的过程（解决主要问题！！）“ExtractURLfromLink”是另一个返回字典的过程，其中包含通过下载 url 的文档字符串分析的所有链接并分析它，您可以根据您的分析添加删除过滤器字符串

Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);

 URL_Can[tempURL] = 1;  //  to set it as scanned link


    richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }



                    }


        statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());

// 这里有另一个技巧来保持这个迭代继续进行，直到它扫描所有收集的链接

                    i++;  if (i >= URL_Can.Count) { i = 0; }

                    if (URL_CanToSave.Count >= 150) { break; }

                }


                richTextBox2.PerformSafely(() => richTextBox2.Clear());

                textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());


                return ProcessCompleted = true;




            }
            catch (Exception aih)
            {

                MessageBox.Show(aih.Message);

                return ProcessCompleted = false;

                throw;
            }


            {
              richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
            }
        })

url - 哈希集处理以避免在迭代期间陷入循环

1 回答 1

Related

Reference