multidim.covertree.CoverTree

class multidim.covertree.CoverTree(pointcloud, ratio=0.41421356237309503, exchange_teens=True, sort_orphans_by_mean=True)[source]

An efficient and convenient implementation of the “Cover Tree with Friends” algorithm. This implementation follows the notation and terminology of the paper [CDER1] as carefully as possible; they were written in concert.

A CoverTree is an Python iterator object [iter1] [iter2]. The __next__() and __getitem__() methods yield the CoverLevel with that index. The entire “friends” algorithm happens in multidim.covertree.CoverTree.__next__()

Parameters:
pointcloud : multidim.PointCloud

The original data from which to construct a cover tree. Note that the labeling/weighting/indexing system requires the use of multidim.PointCloud input, not merely a numpy.ndarray. However, CoverTree ignores all of the higher strata (edges, faces, and so on) of the multidim.PointCloud. Only the points in stratum[0] are used.

ratio : float

Ratio \(\theta\) to shrink radii by at each step. Must satisfy \(0<\theta<1\). Good values are 0.5 or ratio_Ag or ratio_Au. Default: ratio_Ag

exchange_teens : bool

Should teens be exchanged at each step, using Type-2 friends? Default: True

sort_orphans_by_mean : bool

Should orphans be re-ordered by their proximity to weighted mean of the labels? This is particularly useful for improving the cross-validation score of the multidim.models.CDER classifier. Disable for speed ordering of adults is irrelevant for your needs. Default: True

Yields:
:class:`multidim.covertree.CoverLevel`

From level 0 (one ball) until all points are separated. Each CoverLevel is cached once computed.

Notes

This section is excerpted and condensed from [CDER1]

Definition

Let \(X\) be a finite subset of \(\mathbb{R}^d\). The purpose of a cover tree is to build a filtration \(\emptyset \subset CL_0 \subset CL_1 \subset \cdots \subset CL_{\text{max}} = X\) by covering it with balls of smaller and smaller radius centered at points in the set. The points in \(CL_\ell\) are called the adults at level \(\ell\).

Specifically, a cover tree is a filtration of \(X\) with the following additional properties:

  • \(CL_0\) contains a single point, \(a_0\). (see _adult0)
  • There is a radius \(r_0\) (see _r0) such that \(X\) is contained in the ball \(B(a_0, r_0)\) of radius \(r_0\) around \(a_0\)
  • There is a real number \(0< \theta < 1\) (see ratio) such that, for every \(\ell\), the set \(X\) is a subset of \(\cup_{a_i \in CL_\ell} B(a_i, r_\ell)\) where \(r_\ell = r_0 \theta^\ell\)
  • For each \(\ell\), if \(a_i, a_j \in CL_\ell\), then | a_i - a_j| > r_ell. No two adults lie in the same ball.
  • For each \(\ell\), each point \(x \in X\) is assigned to a guardian \(a_i \in CL_\ell\) such that \(x\) lies in the ball \(B(a_i, r_\ell)\). We say \(x\) is a child of \(a_i\) at level \(\ell\). Each \(a_i \in CL_\ell\) is its own guardian and its own child.
  • There is a tree structure on the (level, adult) pairs of the filtration \((\ell, a_i)\), where the tree relation \((\ell, a_i) \to (\ell+1, a_k)\) holds if \(a_k\) was a child of \(a_i\) at level \(\ell\). We say \(a_k\) is a successor of \(a_i\), and \(a_i\) is a predecessor of \(a_k\). Note that \((\ell, a_i) \to (\ell+1, a_i)\) for all \(a_i \in CL_\ell\)

Extending the maturation/reproduction metaphor of adults, children, and guardians above, a child \(x\) with guardian \(a_i\) at level \(\ell\) is called a teen if \(\frac12 r_\ell < \|a_i - x\|\), and it is called a youngin if \(\|a_i - x\| \leq \frac12 r_\ell\). The point of this is that we may require the additional condition:

  • (Optional) On the previous condition, we can additionally require that each \(x\) is the child of the nearest adult, if it lies in the intersection of two or more balls of \(B(a_i, r_\ell)\). If two adults are equally distant, choose the one of the lowest index. This option is enforced by the exchange_teens flag.

When changing from level math:ell to level \(\ell+1\), the radius of each ball shrinks to \(r_{\ell+1} = \theta r_\ell\). Children farther than \(r_{\ell+1}\) from their guardians become orphans. We must decide whether these orphans should be adopted by other adults at level \(\ell+1\), or if the orphans should be emancipated as new adults at level \(\ell+1\). That is, the newly emancipated adults at level \(\ell+1\) comprise the cohort (see cohort) at level $ell+1$.

We say \(a_j \in CL_{\ell}\) is an elder of \(a_k \in CL_{\ell+1}\) if the distance \(\|a_j - a_k\|\) is sufficiently small that \(a_j\) could have been emancipated from \(a_k\) between levels \(\ell\) and \(\ell+1\). That is, if the tree structure were unknown, then elders of \(a_j$ are the possible predecessors. If :math:\) is its own predecessor (because it was already an adult in \(CL_{\ell}\), then the only elder of \(a_k\) is itself.

Example

Consider this point cloud in \(\mathbb{R}^2\)

\[X = \{(0,0.1),(1,2),(0,1),(0,0),(2,2),(2,2.2),(3,3),(1,1)\}\]

We index these points from 0 to 7 in the given order. We have the following filtration

\[\begin{split}&CL_0 = \{7\}\\ &CL_1 = \{3, 4, 6, 7\}\\ &CL_2 = \{1, 2, 3, 4, 6, 7\}\\ &CL_3 = \{1, 2, 3, 4, 6, 7\}\\ &CL_4 = \{1, 2, 3, 4, 5, 6, 7\}\\ &CL_5 = \{0, 1, 2, 3, 4, 5, 6, 7\}\\\end{split}\]

We have the following cover ball radii

\[\begin{split}&r_0 = 2\sqrt{2}\\ &r_1 = \sqrt{2}\\ &r_2 = \frac{\sqrt{2}}{2}\\ &r_3 = \frac{\sqrt{2}}{4}\\ &r_4= \frac{\sqrt{2}}{8}\\ &r_5 = \frac{\sqrt{2}}{16}\end{split}\]

Here we have \(a_0 = (1,1)\), \(r_0 = 2\sqrt{2}\), and \(\theta = 1/2\).

The Friends Algorithm

Our algorithm is based upon the concept of friends. To each adult there will be associated three types of friends. Types 1, 2, and 3 are used to build the CoverTree in typically linear time.

Let \(a_i \in CL_\ell\), that is, \(a_i\) is an adult at level \(\ell\). Define the following thresholds

\[\begin{split}T_1(\ell) &= (2 + \theta)r_l \\ T_2(\ell) &= (2 + 2\theta)r_l \\ T_3(\ell) &= \frac{2}{1 - \theta}r_l.\end{split}\]

It is elementary to show that \(T_1(l) < T_2(l) < T_3(l)\). Moreover, we have the recursion relation \(T_3(l) < T_3(l-1)\).

Each level of the filtreation and all of this associated data is stored in a CoverLevel object.

The algorithm works like this, using a “reproduction” metaphor:

  • Level 0 (see covertree[0] of type CoverLevel) has a single adult. All points are its children. Its only friends are itself.

  • Level \(\ell\) (see covertree[l] of type CoverLevel) has known adults, friends1, friends3, friends3, and children. We now compute level \(\ell+1.\) in __next__()

    1. Shrink the radius by a factor of \(\theta\). Some children become orphans.

    2. Orphans are adopted or become newly emanicpated adults. This uses \(T_1(\ell)\).

    3. If exhange_teens is True, then children who are teens are re-assigned to the closest possible adult. This uses \(T_2(\ell)\).

    1. Compute new friends3.
    2. Use new friends3 to compute new friends1, friends2, friends3.
  • Level \(\ell+1\) (see covertree[l+1] of type CoverLevel) has known adults, friends1, friends3, friends3, and children. We now compute level \(l+2\)

  • Stop when all points are adults.

Levels are evaluated lazily and cached. For example, if no levels have been computed, then covertree[3] will compute levels 0, 1, 2, and 3. Then covertree[5] will use those values for 0, 1, 2, 3 to compute 4 and 5.

References

[CDER1](1, 2, 3) Supervised Learning of Labeled Pointcloud Differences via Cover-Tree Entropy Reduction https://arxiv.org/abs/1702.07959
[CDER2]CDER, Learning with Friends https://www.ima.umn.edu/2016-2017/DSS9.6.16-5.30.17/26150
[iter1](1, 2) https://docs.python.org/3/library/stdtypes.html?highlight=iterator#iterator-types
[iter2](1, 2) https://wiki.python.org/moin/Iterator

Examples

>>> pc = PointCloud.from_multisample_multilabel(
...     [np.array([[0,0.1],[1,2],[0,1],[0,0],[2,2],[2,2.2],[3,3],[1,1]])], [None])
>>> ct = CoverTree(pc, ratio=0.5, sort_orphans_by_mean=False)
>>> cl=ct.next()
>>> list(cl.adults)
[7]
>>> pc.coords.values[7,:]
array([ 1.,  1.])
>>> cl
Level 0 using 1 adults at radius 2.8284271247...
>>> ct.next()
Level 1 using 2 adults at radius 1.4142135623...
>>> for cl in ct:
...     print(cl.exponent, list(cl.adults))
0 [7]
1 [7, 5]
2 [7, 5, 0, 1, 2, 6]
3 [7, 5, 0, 1, 2, 6]
4 [7, 5, 0, 1, 2, 6, 4]
5 [7, 5, 0, 1, 2, 6, 4, 3]
>>> ct.cohort
array([2, 2, 2, 5, 4, 1, 2, 0])
Attributes:
pointcloud : multidim.PointCloud

The original dataset.

ratio : numpy.float64

Ratio \(\theta\) by which to shrink the ball radius between levels.

_r0 : numpy.float64

The initial radius at level 0.

_adult0 : numpy.int64

The index of the original adult. Typically, this is the index of the point nearest the weighted mean of the PointCloud

_levels : collections.OrderedDict

An ordered dictionary to cache the levels computed so far, keyed by the index. Typically, a user would never access this directly. Insead, use covertree[i]

cohort : numpy.ndarray

An array of numpy.int64, which keeps track of the cohort (that is, the level in the filtration) of each point. If a point has not been born as an adult yet, the value is -1

level_pointer : int

Index of the currently referenced CoverLevel, for iteration purposes. Setting this is like using file.seek() on file objects. Usually, you don’t want to mess with it, but it is used internally in mutlidim.models.CDER for comparing entropy between levels.

N : int

The number of points in self.pointcloud

allpoints : numpy.ndarray

The raw NumPY array underlying self.pointcloud.

CoverTree.make_edges([min_distance, …]) Iterate over the edges between the points of the underlying PointCloud, where min_distance < length <= max_distance.
CoverTree.next() See __next__()
CoverTree.plot(canvas, **kwargs) Interactive plot of a CoverTree, with dynamic computation of levels.
CoverTree.plot_tree(canvas[, show_balls, …]) Plot the tree of a CoverTree.
CoverTree.reset() Go to level -1.
CoverTree.sparse_complex([level]) Make a sparse complex from this CoverTree, using the type-4 friends algorithm.