# Thread: How much one's individuality cost? At least 2.77544 bits of information

1. Imagine there is a population/database/dictionary and we would like to distinguish its elements.
So for each element, let us somehow encode its individual features (e.g. using a hash function) as a bit sequence - the most dense way is to use sequence of uncorrelated P(0)=P(1)=1/2 bits.
We can now create minimal prefix tree required to distinguish these sequences, like in the figure below.
For such ensemble of random trees of given number of leaves (), we can calculate Shannon entropy () - the amount of information it contains.
It turns out that it asymptotically grows with at average 2.77544 bits per element .
The calculations can be found here: [1206.4555] Optimal compression of hash-origin prefix trees

Is it the minimal cost of distinguishability/individuality?
How to understand it?

ps. related question: can anyone the sum for average depth of node: ?

Clarification:
The first thought about this distinctiveness is probably n! combinatorial term while increasing the size of the system, but n! is about the order and its logarithm grows faster than linearly.
Distinctiveness is something much more subtle.
It is well seen in equation (10) from the paper (which should like):

where is the average depth of leaves of such n leaf tree - average amount of bits distinguishing given element.
So this equation says:
distinctiveness/individualism + order = total amount of information

distinctiveness grows linearly with n (2.77544 asymptotic linear coefficient)
information about their ordering grows faster: .

Update: Slides about it and graph compression: http://dl.dropbox.com/u/12405967/hashsem.pdf

2.

3. Even individuality is devaluating in our times - to about 2.3327464 bits per element/specimen (thanks to James Dow Allen): http://groups.google.com/forum/#!top...on/j7i-eTXR14E
But this is the final minimal price tag - it cannot be reduced further.
Specifically, for the minimal prefix tree, a random sequence (representing individual features of a specimen) has about 0.721 probability of being identified as belonging to the population ... so if we are interested only in distinguishing inside the population, we can afford increasing this probability up to 1.
To reduce the amount of information in the minimal prefix tree, let us observe that if there appears degree 1 node inside the tree, all sequences from the population going through that node will certainly go in the corresponding direction - we can save 1 bit about the information which exactly is this direction.
In standard prefix tree these degree 1 nodes were the place where it could turn out that an outsider does not belong to the population - removing this information would raise false positive probability from 0.721 to 1.

So if for sequences (0101.., 1010.., 1001..),
the minimal prefix tree remembers (without ordering!): (0....., 101..., 100...),
such reduced one remembers only (0....., 1.1..., 1.0...)

What decreases its asymptotic cost from 2.77544 bits/specimen to about 2.332746.

4. If we can save lg(n!) bits of information about the order of unordered elements, one can ask if we can do the same with graphs having unlabeled vertices?
I've just found such recent paper of Yongwook Choi: "Fast Algorithm for Optimal Compression of Graphs" about Erdos-Renyi graphs (for each two vertices, with p probability there is undirected edge between them):
http://www.siam.org/proceedings/anal..._005_choiy.pdf
So straightforward compression would use h(p)*n*(n-1)/2 bits for encoding the adjacency matrix, but we would like to subtract about lg(n!) bits about the order of vertices (neglecting automorphisms).

The paper uses quite a natural method for representing the graph (explained in a complex way):
- choose an order of vertices(!!!)
- assign length k-1 bit sequence to k-th vertex : 0 on i-th position if it's adjacent to i-th vertex, 1 otherwise (e.g. 010 if it's fourth and adjacent to first and third, not second).
- create binary tree from these sequences (fig. 4 in the paper)
- encode this tree - exactly like in my paper.

I see the first step extremely suspicious - choosing a different order should usually lead to a different tree and so a different encoding sequence. So this encoding seems to be far from being bijection from equivalence classes to encoding sequences, what is required for the optimality. If each of n! ordering would lead to a different tree/encoded sequence, we should completely loose the lg(n!) income ... does anybody understand why this algorithm works so well??

How it could be improved? The encoding algorithm is fine, but we need to make it independent from the ordering and essentially use this independence to get better probabilities.
One way to do it is sorting elements, like sorting the numbers and then encode the small nonnegative differences.
Sorting vertices of graph with unlabeled vertices is faaaar from simple - having polynomial algorithm for it would solve graph isomorphism problem in polynomial time ... but there are possible very near algorithms: for example to each vertex assign a vector, which k-th coordinate is the number of length k cycles from it ((M^k)_ii) and sort these vectors lexicographically ( [0804.3615] Combinatorial invariants for graph isomorphism problem ). However, there are really nasty graphs these invariants don't distinguish: so called strongly regular graphs...

The other problem is using this ordering to reduce the required amount of information ... what generally seems extremely difficult.
So let us start with simpler vertex ordering - just accordingly to their degree (k=2). It makes we know the expected probability distributions varying with depth of nodes: the deeper, the more go right. It would save some information, especially for sparse graphs.

Has anyone any idea how to do it in a really optimal way?

 Bookmarks
##### Bookmarks
 Posting Permissions
 You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Trackbacks are Off Pingbacks are Off Refbacks are On Terms of Use Agreement