Imagine there is a population/database/dictionary and we would like to distinguish its elements.

So for each element, let us somehow encode its individual features (e.g. using a hash function) as a bit sequence - the most dense way is to use sequence of uncorrelated P(0)=P(1)=1/2 bits.

We can now create minimal prefix tree required to distinguish these sequences, like in the figure below.

For such ensemble of random trees of given number of leaves (), we can calculate Shannon entropy () - the amount of information it contains.

It turns out that it asymptotically grows with at average 2.77544 bits per element .

The calculations can be found here: [1206.4555] Optimal compression of hash-origin prefix trees

Is it the minimal cost of distinguishability/individuality?

How to understand it?

ps. related question: can anyone the sum for average depth of node: ?

Clarification:

The first thought about this distinctiveness is probably n! combinatorial term while increasing the size of the system, but n! is about the order and its logarithm grows faster than linearly.

Distinctiveness is something much more subtle.

It is well seen in equation (10) from the paper (which should like):

where is the average depth of leaves of such n leaf tree - average amount of bits distinguishing given element.

So this equation says:

distinctiveness/individualism + order = total amount of information

distinctiveness grows linearly with n (2.77544 asymptotic linear coefficient)

information about their ordering grows faster: .

Update: Slides about it and graph compression: http://dl.dropbox.com/u/12405967/hashsem.pdf