I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet or similar classes. However this seems to be very slow compared to something like HashSet.

For example, this code:

HashSet points = new HashSet();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(new Point(x, y));
        }
    }
}

takes about 22.5 seconds.

While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:

HashSet points = new HashSet();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(x + "," + y);
        }
    }
}

So, my questions are:

  • Есть ли для этого причина? Я проверил этот ответ , но 22,5 секунды - это намного больше, чем цифры, указанные в этом ответе.
  • Есть ли лучший способ хранить баллы без дубликатов?

Ответы (2)

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".

At issue is that HashSet needs an IEqualityComparer to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.

The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.

You can solve both problems by providing the HashSet with a good comparer. Like this one:

class PointComparer : IEqualityComparer {
    public bool Equals(Point x, Point y) {
        return x.X == y.X && x.Y == y.Y;
    }

    public int GetHashCode(Point obj) {
        // Perfect hash for practical bitmaps, their width/height is never >= 65536
        return (obj.Y << 16) ^ obj.X;
    }
}

And use it:

HashSet list = new HashSet(new PointComparer());

And it is now about 150 times faster, easily beating the string test.

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).

Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.

Также обратите внимание, что хэш-кодPoint вычисляется x ^ y. Это приводит к очень небольшому разбросу в вашем диапазоне данных, и поэтому сегменты HashSetпереполнены - чего не происходит с string, где разброс хэшей намного больше.

Вы можете решить эту проблему, реализовав свою собственную Pointструктуру (тривиальную) и используя лучший алгоритм хеширования для ожидаемого диапазона данных, например, сдвинув координаты:

(x << 16) ^ y

For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

2022 WebDevInsider