4

假设我有以下字符串数组作为输入:

foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe

这里使用了 3 个不同的前缀,“foo-”、“barbaz”和“baz”——但是这些前缀并不提前知道(它们可能是完全不同的东西)。

你怎么能确定不同的公共前缀是什么,以便它们可以被分组?这有点棘手,因为在我提供的数据中,有两个以“bazg”开头,一个以“bazf”开头,当然“baz”是前缀。

到目前为止,我尝试将它们按字母顺序排序,然后按顺序循环它们并计算一行中有多少个字符与前一个相同。如果数字不同或 0 个字符相同,则开始一个新组。问题在于它落在了我之前提到的“bazg”和“bazf”问题上,并将它们分成两个不同的组(其中一个只有一个元素)

编辑:好吧,让我们再添加一些规则:

  • 较长的潜在组通常应优先于较短的组,除非有一个紧密匹配的组,其长度差异小于 X 个字符。(所以当 X 为 2 时,baz 比 bazg 更受欢迎)
  • 一个组必须至少有 Y 个元素,否则根本不是一个组
  • 可以简单地丢弃与上述规则中的任何“组”都不匹配的元素。

为了澄清与第二条相关的第一条规则,如果 X 为 0 且 Y 为 2,则两个 'bazg' 条目将在一个组中,并且 'bazf' 将被丢弃,因为它是独立的。

4

3 回答 3

5

好吧,这里有一个快速破解,可能O(something_bad)

IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
    // TODO: error checking
    return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}

IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
    if(source.Any())
    {
        var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
        if (tuple.Item2.Count() >= minGroupSize)
            yield return tuple;
        foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
            yield return element;   
    }
}

Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
    return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}

IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
    while (source.Any() && source.Peek().StartsWith(prefix))
        yield return source.Pop();
}

string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
    var s = new Stack<string>(source);
    var counter = new DictionaryWithDefault<string, int>(0);
    while(s.Any())
    {
        var g = GetCommonPrefix(s);
        if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
            counter[g]++;
        s.Pop();
    }
    return counter.OrderBy(c => c.Value).Last().Key;
}

string GetCommonPrefix(IEnumerable<string> coll)
{
    return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
            let possibleMatch = coll.First().Substring(0, len)
            where coll.All(f => f.StartsWith(possibleMatch))
            select possibleMatch).FirstOrDefault();
}

public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
  TValue _default;
  public TValue DefaultValue {
    get { return _default; }
    set { _default = value; }
  }
  public DictionaryWithDefault() : base() { }
  public DictionaryWithDefault(TValue defaultValue) : base() {
    _default = defaultValue;
  }
  public new TValue this[TKey key]
  {
    get { return base.ContainsKey(key) ? base[key] : _default; }
    set { base[key] = value; }
  }
}

示例用法:

string[] input = {
    "foo-139875913",
    "foo-aeuefhaiu",
    "foo-95hw9ghes",
    "barbazabejgoiagjaegioea",
    "barbaz8gs98ghsgh9es8h",
    "barbaza98fyae9fghaefag",
    "bazfa90eufa0e9u",
    "bazgeajga8ugae89u",
    "bazguea9guae",
    "9a8efa098fea0",
    "aifeaufhiuafhe"
};

GuessGroups(input, 3, 2).Dump();

在此处输入图像描述

于 2013-05-06T13:16:36.030 回答
1

好的,正如讨论的那样,这个问题最初并没有很好地定义,但这是我将如何解决的问题。

Create a tree T
Parse the list, for each element:
    for each letter in that element
        if a branch labeled with that letter exists then 
            Increment the counter on that branch
            Descend that branch
        else 
            Create a branch labelled with that letter
            Set its counter to 1
            Descend that branch

这为您提供了一棵树,其中每个叶子代表您输入中的一个单词。每个非叶子节点都有一个计数器,表示有多少叶子(最终)连接到该节点。现在您需要一个公式来加权前缀长度(节点的深度)与前缀组的大小。目前:

S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour

因此,现在您可以遍历每个非叶节点并为它们分配一个分数 S。然后,要计算出您的组,您将

For each non-leaf node
    Assign score S
    Insertion sort the node in to a list, so the head is the highest scoring node

Starting at the root of the tree, traverse the nodes
    If the node is the highest scoring node in the list
        Mark it as a prefix 
        Remove all nodes from the list that are a descendant of it
        Pop itself off the front of the list
        Return up the tree

这应该给你一个前缀列表。最后一部分感觉像是一些聪明的数据结构或算法可以加快速度(去除所有孩子的最后一部分感觉特别弱,但如果你输入的大小很小,我想速度并不是太重要)。

于 2013-05-06T12:52:55.910 回答
0

我想知道您的要求是否没有关闭。似乎您正在寻找特定的分组大小,而不是特定的密钥大小要求。我下面有一个程序,它会根据指定的组大小,将字符串分解成尽可能大的组,并包括指定的组大小。因此,如果您指定组大小为 5,那么它将按可能的最小键对项目进行分组,以使组大小为 5。在您的示例中,它将foo- 分组f,因为不需要将更复杂的键设置为一个标识符。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication2
{
    class Program
    {
        /// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
        public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
        {
            var smallItems = from item in items
                             where item.Length < keySize
                             select item;
            var largeItems = from item in items
                             where keySize < item.Length
                             select item;
            var largeItemsq = (from item in largeItems
                               let key = item.Substring(0, keySize)
                               group item by key into x
                               select new { Key = x.Key, Items = x.ToList() } into aGrouping
                               group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
                               select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
            if (smallItems.Any())
            {
                var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
                var smallItemsq = (from item in smallItems
                                   let key = item.Substring(0, smallestLength)
                                   group item by key into x
                                   select new { Key = x.Key, Items = x.ToList() } into aGrouping
                                   group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
                                   select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
                return Combine(smallItemsq, largeItemsq);
            }
            return largeItemsq;
        }

        static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
            var x = new Dictionary<bool,Dictionary<string,List<string>>> {
                { true, null },
                { false, null }
            };
            foreach(var condition in new bool[] { true, false }) {
                var hasA = a.ContainsKey(condition);
                var hasB = b.ContainsKey(condition);
                x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
                    : hasA ? a[condition]
                    : hasB ? b[condition]
                    : new Dictionary<string, List<string>>();
            }
            return x;
        }

        public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
        {
            var toReturn = new Dictionary<string, List<string>>();
            var both = Split(maxGroupSize, keySize, items);
            if (both.ContainsKey(false))
                foreach (var key in both[false].Keys)
                    toReturn.Add(key, both[false][key]);
            if (both.ContainsKey(true))
            {
                var keySize_ = keySize + 1;
                var xs = from needsFix in both[true]
                         select needsFix;
                foreach (var x in xs)
                {
                    var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
                    toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
                }
            }
            return toReturn;
        }

        static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
        const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
        static readonly int maxAllowed = allowedChars.Length - 1;

        static IEnumerable<string> GenerateText()
        {
            var list = new List<string>();
            for (int i = 0; i < 100; i++)
            {
                var stringLength = rand.Next(3,25);
                var chars = new List<char>(stringLength);
                for (int j = stringLength; j > 0; j--)
                    chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
                var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
                list.Add(newString);
            }
            return list;
        }

        static void Main(string[] args)
        {
            // runs 1000 times over autogenerated groups of sample text.
            for (int i = 0; i < 1000; i++)
            {
                var s = GenerateText();
                Go(s);
            }
            Console.WriteLine();
            Console.WriteLine("DONE");
            Console.ReadLine();
        }

        static void Go(IEnumerable<string> items)
        {
            var dict = Group(3, items, 1);
            foreach (var key in dict.Keys)
            {
                Console.WriteLine(key);
                foreach (var item in dict[key])
                    Console.WriteLine("\t{0}", item);
            }
        }

    }
}
于 2013-05-06T18:36:37.323 回答