 |
|
 |
|
On Tue, 31 Jan 2012 23:28:07 +0400, Andrey Semashev <...@gmail.com
Right.
Wrong, there's no such requirement. It must work, not necessarilly well.
Simple: it is possible to design a hash function that will cause suboptimal
performance with the container. No matter how you implement the container,
with bit mixing or not, this statement still holds.
I understand your reasoning. You're trying to reduce the probability of
unordered container being inefficient because of an inefficient hash function.
This also approach has the benefit of simplification of hash functions
definition.
What I'm saying is that this pessimization is the wrong thing to do because
you make users pay for what they don't need. Most of the time, when you deal
with hash containers, you have a pretty good idea of the key nature and the
better ways to hash key values. There are times when the key values already
have a good distribution (e.g. the keys are GUIDs) and there is no need in
additional bit mixing. For other cases there are known good hashing algorithms
so you often just use the one which better fits your needs. And for standard
types I would expect std::hash to do the right thing. I know, the standard
doesn't spell out that explicitly (which is a bad thing), but this is how it
works in practice (from my experience).
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 20:06:23 +0000, Daniel James <...@gmail.com
On 31 January 2012 19:28, Andrey Semashev <...@gmail.com
Wow. I give up.
template <struct _LIBCPP_VISIBLE hash<int : public unary_function<int, size_t{
_LIBCPP_INLINE_VISIBILITY
size_t operator()(int __v) const _NOEXCEPT {return
static_cast<size_t};
Last time I tried it, libstdc++ was similar. I do want the containers
to work well with popular implementations of std::hash.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 00:17:22 +0400, Andrey Semashev <...@gmail.com
Too bad. I really hope these libraries will get improved eventually.
Otherwise, std::hash is useless aside from std::unordered* containers.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 10:57:18 -0500, Howard Hinnant <...@gmail.com
You've alluded to the crux of the issue:
Is std::hash<scalar
The standard is silent on this question.
A good general purpose std::hash<scalar
A std::hash<scalar
As a judgement call, I calculated that I would have *many* more clients who are casual users of the std::unordered containers, defaulting most options such as the hash function, than I would have clients interested in using std::hash<scalar
So I stand by the current implementation of std::hash<int
The current hash<int
Implications of that assumption: If you're in need of a hash function for some need other than a good match for the std::unordered containers, std::hash may not be the solution you're looking for. The only std::solutions I'm aware of that might come close to fitting such a need are found in <random
Howard
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 21:15:48 +0400, Andrey Semashev <...@gmail.com
Thank you for your careful answer. As I understand, in order to compensate the
std::hash simplicity you do additional hashing (bit mixing or modulus) in the
container. May I ask, have you considered the other approach - to make the
container simpler but add bit mixing to std::hash? Why you chose not to follow
this path?
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 18:20:33 +0100, Roman Perepelitsa <...@gmail.com
2012/2/1 Andrey Semashev <...@gmail.com
std::hash can be specialized for user defined types. If unordered
containers were to assume that std::has does bit mixing, users would have
harder time specializing std::hash which could lead to performance bugs.
Roman Perepelitsa.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 21:35:06 +0400, Andrey Semashev <...@gmail.com
When writing a custom hash function, whether std::hash specialization or not,
one cannot rely on bit mixing in the container because it is not mandated by
the standard. So one would write the function with bit mixing anyway.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 12:52:44 -0500, Howard Hinnant <...@gmail.com
Yes, I've considered that path. The libc++ containers are constrained to hold only a prime number of buckets. And all hashes are modulo'd to fit within the number of buckets. The use of modulus combined with a prime number of buckets has a long and successful history for hash containers, dating back longer than I've been in this industry.
Another approach I've read about is to use a power of 2 number of buckets and an & operation to bring the hash into the bucket range. If I had chosen this path I would probably also choose to use a stronger bit mixing for all default hash functions. Otherwise, poor collision performance is all but assured.
One of the reasons I prefer the prime solution to the power of 2 solution is I believe it provides better protection against user-written hash functions that happen to be poorly chosen. I have decided to opt against "well, you wrote a poor hash function, you get what you deserve." Had this protection been unreasonably expensive, I would not have chosen this path. However, in my estimation this cost is very, very low. The use of the modulus is more expensive than an &, but not greatly so on modern hardware.
Expensive things to do on older hardware include instructions like 64 bit modulus and floating point divide. When I wrote Fortran 30 years ago you could literally predict the performance of numerical code by just counting the floating point divides. You could completely ignore the additions and multiplications because they were practically free by comparison.
On modern hardware expensive things to do include flushing the L1 cache and allocating memory. By comparison, individual instructions (except for memory barriers) are practically free.
The power of 2 solution is often associated with "incremental rehashing". As I recall this has the advantage of spreading a rehash out over many insertions so that no one insertion is very expensive. However the API of the std::unordered containers is very "vector-like": You can predict when rehash's are coming by inspecting things like the number of buckets, the load factor and the maximum load factor. You can even plan when your rehashes are going to happen by pre-allocating a predicted number of needed buckets, and or manipulating max_load_factor, and then calling rehash explicitly at convenient times.
So because of the ability to fine tune when and how rehashes happen, the chief selling point of incremental rehashing (and subsequently power of 2 number of buckets) is somewhat muted, imho.
I am not always right. Indeed I pride myself on continually learning new things. Another way of pronouncing "continually learning" is: I was wrong, I learned something new, and now I think I'm right. So I could be wrong here too, and am about to learn something. I accept that. However I've been studying and shipping hash containers for about a dozen years now, and my current position is based on my customer feedback over that period of time.
Howard
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 17:59:10 +0000, "Stephan T. Lavavej" <...@exchange.microsoft.com
[Howard Hinnant]
Integer divisions are surprisingly expensive, relatively speaking. While changing deque, we introduced integer divisions that noticeably harmed performance. We got rid of them (restoring the perf and then some) by introducing a power-of-2 guarantee that let us use bitwise operations instead.
Of course, this will only matter in very high performance code, like the fundamental operations performed by a container.
STL
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 17:03:14 -0500, Topher Cooper <...@topherc.net
Ah, good. Something non-contentious that I can contribute.
There is a standard technique used in hashing and some other situations,
that allows a divide/mod to reduce a value to an arbitrary range with a
multiply and shift (frequently, in practice, multiplying into a wider
width location, then indexing). For this to work well, however, higher
order bits/digits/oits/hits/... must be well distributed. If this
cannot be relied upon, techniques like bit rotates, swapping halves,
xor-oring low bytes onto the high ones or doing a multiply by a constant
ignoring overflow, before applying the algorithm can fix things.
The technique is to multiply the original value by the range and just
use the overflow. Just think of the value to be hashed as the numerator
of a rational with one plus the maximum value as the denominator.
Topher
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 16:12:35 +0100, Olaf van der Spek <...@vdspek.org
On Tue, Jan 31, 2012 at 4:10 PM, Kai Schroeder
<...@googlemail.com
Yes, sounds like a bug in TBB. I don't think it's a good idea to discard bits.
Olaf
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 11:19:29 -0500, Dave Abrahams <...@boostpro.com
on Tue Jan 31 2012, Olaf van der Spek <ml-AT-vdspek.org
Perhaps not, but it seems to me that we have a problem, too.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 20:50:40 +0400, Andrey Semashev <...@gmail.com
In case you missed it, there was this [1] conversation which basically boiled
down to where bit mixing of the hash value should be implemented - in the hash
function or the container itself. The standard doesn't mandate this, AFAICT,
but my understanding is that it's the hash function's job.
[1] <http://news.gmane.org/find-
root.php?group=gmane.comp.lib.boost.devel&article=227609
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 18:33:25 +0100, Thomas Klimpel <...@synopsys.com
At least that is how I understood that conversation
What do you mean by "my understanding"? Your understanding of the standard, or your understanding of the best solution, or your understanding of that conversation? You also advocated your "understanding" during that conversation. However, I got the impression that the other party didn't really wanted to discuss about that:
Regards,
Thomas
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 21:57:32 +0400, Andrey Semashev <...@gmail.com
The standard doesn't favor or preclude either approach, so either way the
implementation would be conforming. So, you might say it's my understanding of
both the standard and the "best" solution. I realize that my opponent had a
different opinion on this and I didn't convince him.
I remember. And it saddens me.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 18:55:21 +0000, Daniel James <...@gmail.com
On 31 January 2012 17:57, Andrey Semashev <...@gmail.com
The requirements for the hash function are:
For two different values
t1 and t2, the probability that h(t1) and h(t2) compare
equal should be very small, approaching 1.0 / numeric_-
limits<size_t
There's no requirement about how h(t1) and h(t2) are distributed so it
must be the container's responsibility, since the container must work
well with any hash function that meets these requirements. How else
can you interpret it?
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 10:11:27 -0500, Dave Abrahams <...@boostpro.com
on Tue Jan 31 2012, Daniel James <dnljms-AT-gmail.com
You can interpret it as a QOI issue, and we should aim to provide the
highest QOI.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 16:02:35 +0000, Daniel James <...@gmail.com
On 1 February 2012 15:11, Dave Abrahams <...@boostpro.com
I think you might have misunderstood my point, the question here is
whether the containers should work well with hash functions (i.e.
other than std::hash and boost::hash) that meet the standard's
requirements but don't have a good distribution, even though that
makes the container a bit slower (this came up on the other thread
because the hit from doing a modulus is much higher on 64-bit
machines). Andrey thinks they shouldn't, I think they should.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 19:44:57 -0500, Topher Cooper <...@topherc.net
As what it says.
It does not say anything about the distribution from which the two
different values are drawn so the implication is that for /any /two
values the probability must approach 1/max-value. Technically, of
course, this is impossible -- you can always choose the values from a
distribution based on inverting the hash (that this might be extremely
costly or even infeasible is irrelevant). A reasonable view, however,
is that if the values are chosen using any reasonable method that is not
too specific to the particular hash method then the probability of a
collision is as stated. Any simple pattern in the input that results in
a pattern in the output -- a restriction over the possible range of
output values for meaningful subsets of all possible values -- will
produce collision probabilities much higher than that.
If we do not read it that way -- if we take "two different values" to
mean "two values uniformly selected from all possible input values" --
then the requirement is trivial. For any kind of value that is
represented in log-2(numeric_limits<size_tuse the identity function, for anything larger, just use the first
bits-per-size_t-value bits and throw out the rest. That is all it takes
to conform to the standard by that reading.
But even if this was not the case and this hash function can be taken to
conform to the standard --
There is no explicit requirement for quality anywhere in the standards
to the best of my knowledge. For example, a linear asymptotic
performance requirement does not /specify/ that the there isn't a two
minute overhead independent of N for each operation -- avoiding that in
an implementation is about /quality/ rather than /conformance/. You can
consider it to be an accurate implementation of the standard but you
can't consider it a good one.
There are specialized applications where a poorly distributed hash (at
least poorly distributed in just the right way) is not only acceptable
but required. Sometimes you want to ignore some specific differences so
that values that differ only by those characteristics still hash to the
same value (obvious example -- case-insensitive string hashes). When
hashing is used to locate similar objects the hash function must
transform values that differ in small ways -- that are similar -- to
hash values close to each other. The classic Gorin spelling correction
algorithm hashed each word from a document by its length and first two
letters allowing any word with a single letter insertion, deletion or
replacement or a two letter transposition to be easily found, and there
are lots of geometric algorithms that reduce dimensionality or range in
a desirable way so that "close" points are still close, or at least
likely to be.
But there are a number of characteristics that are routinely considered
the characteristics of a /general purpose/ hash function of at least
moderate quality and the avoidance of this kind of regularity is one of
them.
The suggestion that the output could be fixed by adding a "bit mixing
function" is not really relevant, and neither is the statement that
because the standard does not specify a good quality hash that it is the
container's responsibility to "fix it". The only way for the container
to do that is to rehash the hash value -- which is essentially what the
bit-mixing function does (such a function would be a not-terrible hash
function for fixed width values for many purposes -- as long as input
and output having the same number of 1-bits is not a problem. Note,
however, that the specific example that demonstrated the problem
produced lots of bits the same, which would still reduce the number of
possible values for structured inputs however much the bits were
shuffled around).
To specify that a hash-based algorithm should be written to be proof
against poorly distributed inputs essentially means that the expectation
is that the input hash value should be rehashed. Of course, its
standard for have hash tables to have prime-like lengths (generally
taken to mean to avoid lengths with small -- especially powers of 2 --
divisors) but that has to do with the quality of the necessary range
reduction (a.k.a., value folding). This is generally considered good
enough precaution to allow "raw integer values" to be used directly for
many applications, but for less casual use, this is not acceptable
(values that differ by the table size end up in the same bin, and values
that differ by a small amount end up in nearby bins -- which can cause
excess collisions for data with some corelational structure).
Lets turn one of the arguments made on its head. Why should someone who
finds it convenient to use a reasonable quality hash on their data type
pay the overhead of using a container that compensates for the
possibility of a poor one? Is it a more reasonable expectation that
someone who wishes a different balance of quality vs performance then
the default should rewrite the standard hash or rewrite the standard
unordered_map?
Sorry for the length -- I had a lot to say, I guess.
Topher Cooper
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 08:59:56 +0000, Daniel James <...@gmail.com
On 1 February 2012 00:44, Topher Cooper <...@topherc.net
Is that likely though? And any more likely than other causes of collisions?
The STL hash containers are pretty much required to be implemented
using a chained hash table.
There are very good alternative open source implementations out there.
You shouldn't need to rewrite anything.
Why do you think I said I didn't want to get into this debate? I do
realise that there's a lot to say, and I don't have any enthusiasm for
saying it. I wrote the thing 7 years ago.
IMO a more appropriate forum would be one of the C++ newsgroups, as
this is more about how the standard should be implemented, rather than
my particular implementation which should really become obsolete over
the coming years. Boost.Hash's advantages over standard
implementations are its portability and its customisation abilities.
The former will hopefully become less relevant, the latter doesn't
seem to have been that popular. In other respects the standard
implementations should be better. I'm not going to compete with them.
If anyone does want to give it a shot, I'd recommend starting from
scratch (probably using existing hash algorithms) or using something
like libc++'s implementation as a starting point. Most of the effort
that's gone into Boost.Hash was to get it working with older
compilers, and the implementation has been wrapped around that.
There's also the problem with implicit casts which is something of a
flaw in the customisation mechanism, so that could do with a rethink.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 14:08:17 -0500, Topher Cooper <...@topherc.net
It is immensely more likely than any of the causes of /systematic/
collisions in a reasonable implementations that avoids destroying
expected performance (O(1)) and replacing it with worst case performance
(O(N)) for simple regularities in the input data. The expected
performance takes into account the occurrence of non-systematic
collisions. No algorithm can guarantee that there cannot be systematic
collisions (since it must be deterministic) but it is easy to make sure
that regularities in the input data would have to be incredibly
arbitrary to cause them.
Topher Cooper
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 13:43:30 +0400, Andrey Semashev <...@gmail.com
On Wed, Feb 1, 2012 at 12:59 PM, Daniel James <...@gmail.com
I think, this list is appropriate for the discussion, as it is about
Boost.Hash (and Boost.Unordered, as long as it is intended to be used
with Boost.Hash). At least, it's how it started.
I don't think Boost.Hash and Boost.Unordered have to die out
eventually. One possible direction of further development of these
libraries could be an attempt to fix flaws in the STL implementations
by providing superior tools.
A few posts back you cited std::hash implementation from libc++. I
doubt it can be worse than that.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 15:43:51 -0500, Topher Cooper <...@topherc.net
But if, as you say, the standard implies this trade-off, then conformant
implementations will end up with roughly the same trade-off. This is
only an option if your justification for a low-quality hash function is
incorrect. Hashing functions are easily implemented and fairly easily
adjusted, replacing a complete container implementation or adjusting one
is much more work. If one has a separate hash function and hash
container it makes more sense to depend on the hash function to be
responsible for the quality of the hash rather than the container, in
addition to it being simpler for the developer using it.
Since there are /very good/ implementations available, what is the
justification for a poor one in the Boost library?
Besides, it would not occur to most developers that the Boost
implementation of standard hash is not up to the quality standards that
Boost is known for overall. It never occurred to me to check, and I
have the necessary background to determine that it isn't if I had
checked, and to understand the consequences that it is not. I have some
code to replace, and some performance results to review to see if they
were effected (probably not, but I need to check to be sure).
Topher
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 21:02:16 +0000, Daniel James <...@gmail.com
On 1 February 2012 20:43, Topher Cooper <...@topherc.net
I meant alternative open source hash tables, not necessarily ones that
meet the standard's unordered container requirements.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 16:39:51 -0500, Topher Cooper <...@topherc.net
So your suggestion is that if they discover a problem that requires a
different performance trade-off, that they should be expected to either
rewrite their code or write an interface adaptor, instead of being
expected to rewrite a few lines of hash function code to meet their needs?
Topher
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Thu, 2 Feb 2012 22:29:17 +0000, Daniel James <...@gmail.com
On 1 February 2012 21:39, Topher Cooper <...@topherc.net
No, my suggestion is that they do whatever floats their boat.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 16:13:01 -0500, Topher Cooper <...@topherc.net
I'm sympathetic -- I don't have time for this either, but I didn't feel
that leaving these points of your justification unanswered was responsible.
I've made lots of much worse errors in the past -- nothing wrong about
that. And you are to be commended, like all the Boost library
implementors, for having made a substantial contribution. If you don't
want to defend the decisions you made, though, just admit to
misjudgement or let it pass.
The quality of the Boost implementation and whether it conforms to the
requirements and Boost quality standards belongs in the Boost list.
There may also be a place for part of the discussion in more general C++
lists, if the belief that the specs imply low quality hash functions is
widespread in the community. Could be, but I have no evidence to that
effect.
The point is that there is a case to be made that this implementation
should have been obsolete seven years ago at its birth. That this
wasn't caught by the community back then is a failure of the whole
community -- especially those of us who know something about the
subject, even those like me whose community participation is sporadic.
We can't do anything about the last seven years, but I think that it is
clear that it should already be considered obsolete, and that replacing
it should be a group priority. Unfortunately, my clients would be
justified, I'm afraid, in considering the time I've spent writing on
this as being somewhat irresponsible to them (deadlines long since unmet
on work already paid for). If someone hasn't done something when I come
up for air, I'll see about fixing this myself.
As for its customisation (cap)abilities you argued earlier in the same
message that a major aspect of its customizability is that it can be
replaced with something else.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Thu, 2 Feb 2012 22:26:19 +0000, Daniel James <...@gmail.com
On 1 February 2012 21:13, Topher Cooper <...@topherc.net
Funnily enough, that wasn't actually my design decision. Well, I
suppose I chose to follow someone else's design. Given the
requirements (meeting the draft standard, being highly portable
without requiring too much development effort), I do think it was the
right decision. And you've provided no evidence to the contrary, which
is
I do feel an obligation to respond to emails concerning my libraries.
I know I shouldn't but there you go.
I mentioned the implementation of std::hash<intlibstdc++ earlier.
Blimey. If it wasn't clear, I'm very happy for other people to propose
a new hash library. This one was written to meet a need, not to be the
bestest hash function ever. The only real problem has been the
implicit cast issue.
I'm sorry I forced you to waste your time.
You seem to be arguing against a point of view that no one holds by
using my explanation for not holding that point of view.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Thu, 02 Feb 2012 20:07:04 -0500, Topher Cooper <...@topherc.net
If I considered it a waste, I wouldn't have devoted the time -- there
was no "force" involved.
Things have gotten a bit heated, my fault, I apologize.
My misunderstanding. I argued that the choice required that a user
wishing some different trade-offs would need to modify or rewrite a
large complex piece of code instead of a small, simple one. Your
response was that they could just use a completely different,
non-standard conformant, library instead of customizing the code. I'm
not sure what you are now saying (really) -- that I was right and that
there is no reasonable way for library users to choose different
performance trade-offs?
I really am not trying to attack, just to understand.
Topher Cooper
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Sat, 4 Feb 2012 09:52:19 +0000, Daniel James <...@gmail.com
On 3 February 2012 01:07, Topher Cooper <...@topherc.net
It isn't that bad. STL style containers are pretty well established so
containers are often replaceable. For example, Google's hash
containers are written to have a similar API to the standard
containers, although they add some extra requirements to their
elements. I don't think they implement anything from C++11 (yet?).
Bizarrely, they even supply local iterators, even though they don't
really do what they should, and I don't think anyone would miss them.
To some extent, you need to accept reduced performance when using
generic components. They do tend to be less efficient than more
specialised ones. The STL and Boost have both reduced that cost, but
it is still there. An example of this is allocators, fully supporting
C++11 allocators can introduce inefficiencies or prevent optimisations
because of the need to support custom pointer classes. A certain
amount of template magic can reduce that cost though. Something I've
considered in the past is adding a type trait for hash functions which
specifies when power of two buckets can be used, although I'm always
wary of introducing extra complexity to the implementation.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 15:15:59 -0500, Topher Cooper <...@topherc.net
I've seen that claim, and I might well be missing something, but I
believe that coalesced hashing would also work. Deletion (erasure)
algorithms that retain an expectation of a few probes up to very high
load factors are complicated and somewhat costly though (as I remember
-- its been a long time since I worked with this) asymptotically within
bounds. I don't know of any shortcuts in resizing beyond rehashing all
the entries, but the requirements are consistent with that, and resizing
a growing table needs to be done less often for chained hashing due to
good performance up to much higher load factors than external chaining.
The cost is a less trivial implementation (but that is what libraries
are for), and constant higher expected cost of element erasure. The
gain is frequent much better memory usage.
In any case, how collisions are handled is irrelevant to my point. If
it wasn't clear, the structure that I was describing as possibly
"corelational" is structure in the data to be hashed rather than the
structure of the container.
Let's take an example. Let's say that I'm mapping usernames to some
kind of object, and that usernames consist of a last name plus a number
assigned in order when there is a duplicate -- a common practice. The
presence of a last name increases the probability that a last name is
common, so the presence of "cooper3" makes it more likely that there
will be a "cooper4". If the hashing scheme hashes those "adjacent"
usernames to adjacent hash-values/bins then if "james23" happens to
collide with "cooper3" then "james24" will also collide with "cooper4".
Consequently, under these conditions, moderate load factors will result
in somewhat worse than the constant performance expected from truly
uniform hashes.
For what it is worth, FNV1 hashing suffers from precisely this
correlational sensitivity under these circumstances (although having the
next last byte being likely to end up in either the next /or/ the
previous bin but will do so consistently and this may result in the
third in the sequence being more likely to collide with the first),
which is why the so called "slightly better" distribution properties of
FNV1a (which corrects this defect) are not as trivial as the developers
imply.
Topher
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 21:01:49 +0000, Daniel James <...@gmail.com
On 1 February 2012 20:15, Topher Cooper <...@topherc.net
Note that I said 'pretty much required', not 'required'.
Container elements can be uncopyable and unmovable.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 01 Feb 2012 16:32:36 -0500, Topher Cooper <...@topherc.net
Ah, I did miss that, but it just means that the underlying table include
indirection as open chaining generally does intrinsically (I say
generally, because it is possible to implement open chaining if that
were not a requirement with the first entry contained in the table).
Topher Cooper
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Thu, 2 Feb 2012 22:28:35 +0000, Daniel James <...@gmail.com
On 1 February 2012 21:32, Topher Cooper <...@topherc.net
Wouldn't that defeat the purpose? If you have to follow an indirection
for every key comparison, then you've lost the main advantage of using
coalesced hashing.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Thu, 02 Feb 2012 19:04:46 -0500, Topher Cooper <...@topherc.net
I'd have to look at the figures in detail, but I don't think so. You do
straight-forward coalesced hashing either with the key kept in the
table, or perhaps its size_t sized hash-values as proxies, and pointers
to separately allocated vector for the values pointed at. You still get
all the goodness -- essentially better paging, less space, and compared
to external chaining with as many entries as bins, about equal failure
search probes and slightly better success search probes.
In any case, the issue is whether it conforms to the standard and is a
reasonably reasonable choice. I'm not recommending it -- and certainly
not as the primary implementation. Its possible that a library might
provide it as an alternative, chosen either manually or via policies.
Topher Cooper
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 17:19:28 +0100, Kai Schroeder <...@googlemail.com
I have found that this behaviour of boost hash is described in the
documentation:
http://www.boost.org/doc/libs/1_48_0/doc/html/hash/rationale.html
So, this is by design. This rationale is understandable, however it
can lead to difficult to find performance problems as in our case.
Btw.: std::hash in MSVC10 does not show this behaviour. Well, I tend
to agree that this is a bug in TBB, still I would personally prefer a
better hash function (even if it is a bit slower) by default. Maybe a
fast_hash option could be provided as an alternative.
Best regards,
Kai
On 1/31/12, Olaf van der Spek <...@vdspek.org
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 17:59:36 +0000, "Stephan T. Lavavej" <...@exchange.microsoft.com
[Kai Schroeder]
In VC11, we've reimplemented std::hash. We now use FNV-1a for everything, which is incredibly fast (significantly faster than before, according to my measurements) and pretty good at mixing bits.
STL
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 19:58:04 +0000, Daniel James <...@gmail.com
On 31 January 2012 16:19, Kai Schroeder <...@googlemail.com
I actually don't. If TBB requires a better quality hash function, then
that's fine. I also wouldn't use boost::hash or std::hash with
google's hash tables.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 20:31:20 +0000, Christopher Jefferson <...@bubblescope.net
It would be easy to provide a boost::hash_shuffle, that could be applied to any boost::hash and provide this stronger requirement (that there is no corrolation between (a-b) and (hash(a)-hash(b)). This would avoid the need to re-write all the existing hash functions.
Chris
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Tue, 31 Jan 2012 20:57:11 +0000, Daniel James <...@gmail.com
On 31 January 2012 20:31, Christopher Jefferson <...@bubblescope.net
Feel free to do so. Although, I'd probably write it to use better
algorithms where possible, and just use boost::hash as a fallback for
when they're not. I think I've mentioned before that I regret calling
it 'boost::hash', 'boost::functional::hash' would have been a better
name.
Btw. to go back to the original post, I might change the mix function
for floats, it might be problematic as it is. Also, the float hash is
actually a bit slow because it's doesn't rely on the binary
representation of floating point numbers. But when the binary
representation is known to be safely hashable, it might be better to
hash that. I already do that for cygwin because the floating point
functions weren't available there. The problem is accounting for the
different representations, which I don't want to spend too much time
dealing with.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
 |
|
 |
|
On Wed, 1 Feb 2012 16:04:56 +0000, Daniel James <...@gmail.com
On 1 February 2012 15:09, Dave Abrahams <...@boostpro.com
There has been talk of other, more comprehensive, hashing libraries
since which would like to use the namespace 'boost::hash', but it
isn't available. It feels wrong to me that such a slight library
grabbed the name. Since it was accepted boost has moved more towards
putting classes in sub-namespaces. 'functional' isn't that a big a
deal, it could be 'boost::container::hash', or
'boost::unordered::hash'.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
|
|
 |
|
 |
|
|