June 30, 2015

Matthieu Brucher

Audio Toolkit: Anatomy of a transient shaper

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:
Transient shaper pipelineTransient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:
Transient shaper plotTransient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found on Github here: https://github.com/mbrucher/AudioTK/blob/master/Examples/dynamic/display_transient_shaper.py

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 30, 2015 07:52 AM

June 29, 2015

Wei Xue

GSoC Week 5

The week 5 began with a discussion with whether we should deprecate params. I fixed some bugs in checking functions, random number generator and one of covariance updating methods. In the following days, I completed the main functions of GaussianMixutre and all test cases, except AIC, BIC and sampling functions. The tests are some kind of challenging, sine the current implementation in the master branch contains very old test cases imported from Weiss's implementation which is never got improved. I simplified the test cases, and wrote more tests that are not covered by the current implementation, such as covariance estimation, ground truth parameter prediction, and other user-friendly warnings and errors.

Next week, I will begin to code BayesianGaussianMixture.

June 29, 2015 04:03 PM

June 26, 2015

Matthew Rocklin

Write Complex Parallel Algorithms

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We discuss the use of complex dask graphs for non-trivial algorithms. We show off an on-disk parallel SVD.

Most Parallel Computation is Simple

Most parallel workloads today are fairly trivial:

>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
          .map(lambda d: d['type'] == 'PushEvent')

Graphs for these computations look like the following:

Embarrassingly parallel dask graph

This is great; these are simple problems to solve efficiently in parallel. Generally these simple computations occur at the beginning of our analyses.

Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics , machine learning, etc.. Often this stage fits comfortably in memory, so we don’t worry about parallelism and can use statsmodels or scikit-learn on the gigabyte result we’ve gleaned from terabytes of data.

However, if our reduced result is still large then we need to think about sophisticated parallel algorithms. This is fresh space with lots of exciting academic and software work.

Example: Parallel, Stable, Out-of-Core SVD

I’d like to show off work by Mariano Tepper, who is responsible for dask.array.linalg. In particular he has a couple of wonderful algorithms for the Singular Value Decomposition (SVD) (also strongly related to Principal Components Analysis (PCA).) Really I just want to show off this pretty graph.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)

Parallel SVD dask graph

This algorithm computes the exact SVD (up to numerical precision) of a large tall-and-skinny matrix in parallel in many small chunks. This allows it to operate out-of-core (from disk) and use multiple cores in parallel. At the bottom we see the construction of our trivial array of ones, followed by many calls to np.linalg.qr on each of the blocks. Then there is a lot of rearranging of various pieces as they are stacked, multiplied, and undergo more rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are available in many chunks at the top and second-from-top rows.

The dask dict for one of these arrays, s, looks like the following:

>>> s.dask
{('x', 0, 0): (np.ones, (1000, 1000)),
 ('x', 1, 0): (np.ones, (1000, 1000)),
 ('x', 2, 0): (np.ones, (1000, 1000)),
 ('x', 3, 0): (np.ones, (1000, 1000)),
 ('x', 4, 0): (np.ones, (1000, 1000)),
 ('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
 ('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
 ('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
 ('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
 ('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
 ('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
 ('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
 ('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
 ('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
 ('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
 ('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
 ('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
                                   [('tsqr_2_R_st1', 0, 0),
                                    ('tsqr_2_R_st1', 1, 0),
                                    ('tsqr_2_R_st1', 2, 0),
                                    ('tsqr_2_R_st1', 3, 0),
                                    ('tsqr_2_R_st1', 4, 0)])),
 ('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
 ('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
 ('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}

So to write complex parallel algorithms we write down dictionaries of tuples of functions.

The dask schedulers take care of executing this graph in parallel using multiple threads. Here is a profile result of a larger computation on a 30000x1000 array:

Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is. However, he is more an expert at linear algebra than at Python programming. Dask graphs (just dictionaries) are simple enough that a domain expert was able to look at them say “Yeah, I can do that” and write down the very complex algorithms associated to his domain, leaving the execution of those algorithms up to the dask schedulers.

You can see the source code that generates the above graphs on GitHub.

Approximate SVD dask graph

Randomized Parallel Out-of-Core SVD

A few weeks ago a genomics researcher asked for an approximate/randomized variant to SVD. Mariano had a solution up in a few days.

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)

I’ll omit the full dict for obvious space reasons.

Final Thoughts

Dask graphs let us express parallel algorithms with very little extra complexity. There are no special objects or frameworks to learn, just dictionaries of tuples of functions. This allows domain experts to write sophisticated algorithms without fancy code getting in their way.

June 26, 2015 12:00 AM

June 24, 2015

Titus Brown

A review of "Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees"

(This is a review of Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees, Solomon and Kingsford, 2015.)

In this paper, Solomon and Kingsford present Sequence Bloom Trees (SBTs). SBT provides an efficient method for indexing multiple sequencing datasets and finding in which datasets a query sequence is present.

The new method is based on using multiple Bloom filters and organizing them in a binary tree, where leaves represent specific datasets and internal nodes contain all the k-mers present in their subtrees. A query starts by breaking the sequence into a set of k-mers and checking if they are present in the node Bloom filter at a specific threshold. If yes then the query is repeated for children nodes, but if it isn't the subtree is pruned and search proceeds on other nodes. If all searches are pruned before reaching a leaf then the sequence is not present in any dataset. They prove the false positive rate for a k-mer can be quite higher than traditional applications of Bloom filters, since they are interested in finding if the whole set of k-mers is over a threshold. This leads to very small data structures that remain capable of approximating the correct answer.

Compared to alternative software (like SRA-BLAST or STAR) it has both decreased runtime and memory consumption, and it also can be used as a filter to make these tools faster.

Overall review

The paper is well written, clear, mostly expert in the area (but see below), and lays out the approach and tool well.

The approach is novel within bioinformatics, as far as we know. More, we think it's a tremendously important approach; it's by far the most succinct representation of large data sets we've seen (and Bloom filters are notoriously efficient), and it permits efficient indexing, storage of indices, and queries of indices.

A strange omission is the work that has been done by our group and others with Bloom filters. Pell et al., 2012 (pmid 22847406), showed that implicit De Bruijn graphs could be stored in Bloom filters in exactly the way the authors are doing here; work by Chikhi and Rizk, 2013 (pmid 24040893) implemented exact De Bruijn graphs efficiently using Bloom filters; and Salikhov et al, 2014 (pmid 24565280) further used Cascading Bloom filters. Our group has also used the median k-mer abundance (which, in a Bloom filter, equals median k-mer presence) to estimate read presence and coverage in a very similar way to Solomon and Kingsford (Brown et al., 2012, "digital normalization"). We also showed experimentally that this is very robust to high false positive rates (Zhang et al., 2014, pmid 25062443, buried in the back).

There are three points to make here --

  1. Previous work has been done connecting Bloom filters and k-mer storage, in ways that seem to be ignored by this paper; the authors should cite some of this literature. Given citation space limitations, this doesn't need to be exhaustive, but either Salikhov or Pell seems particularly relevant.
  2. The connection between Bloom filters and implicit De Bruijn graphs should be explicitly made in the paper, as it's a powerful theoretical connection.
  3. All of our previous result support the conclusions reached in this paper, and this paper makes the false-positive robustness argument much more strongly, which is a nice conclusion!


We have found that users are often very confused about how to pick the size of Bloom filters. My sense here is that the RRR compression means that very large Bloom filters will be stored efficiently, so you might as well start big, because there's no way to do progressive size increases on the Bloom filter; do the authors agree with that conclusion, or am I missing something?

One possible writing improvement is to add another level under the leaves in Supp Fig 1 to make it clear that traditional alignment or other alternatives are required, since SBT only finds if the query is present in the dataset (but not where). The speed comparisons in the paper could be qualified a bit more to make it clear that this is only for basic search, although some of us think it's already clear enough so it's advice, not a requested or required change.

However, there is a solid point to be made that (in our opinion) the true value of the SBT approach is not necessarily in speeding up the overall process (3.5x speedup) but in doing the search in very low memory across an index that can be distributed independently of the data.

page 16, Theorem 2 says the probability that ... is nearly 0 when FPR is << theta, fraction threshold. But next in the example, theta is 0.5 and FPR is also 0.5, so here the FPR is NOT << theta, as in Theorem 2. How to conclude that "by Theorem 2, we will be unlikely to observe > theta fraction of false positive kmers in the filter."?

Software and tool publication

Bioinformatics paper checklist (http://ivory.idyll.org/blog/blog-review-criteria-for-bioinfo.html):

The software is directly available for download: Yes, https://github.com/Kingsford-Group/bloomtree

The software license lets readers download and run it: The license is not specified; this needs to be fixed. But 'bloomtree' makes use of several GPL toolkits.

The software source code is available to readers: Yes, https://github.com/Kingsford-Group/bloomtree

We successfully downloaded and ran the software.

The data for replication is available for download: Yes, public data from SRA; it's listed on supp materials, but could be added to the tool site too.

The data format is either standard, or straightforward, or documented. Yes

Other comments:


  • we strongly recommend that a lab-independent URL be used as the official URL for the software (e.g. the github page, instead of the CMU page). Lab Web sites tend to fall out of date or otherwise decay.
  • One of the big drawbacks to Bloom filters is that they are fixed in size. Guidance on choosing Bloom filter size would be welcome. One way to do this is to use an efficient method to calculate cardinality, and khmer has a BSD- licensed implementation of the HyperLogLog cardinality counter that they'd be welcome to copy wholesale.


C. Titus Brown

Luiz Irber

Qingpeng Zhang

by C. Titus Brown, Luiz Irber, Qingpeng Zhang at June 24, 2015 10:00 PM

Thoughts on Sequence Bloom Trees

We just submitted our review of the paper Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees., by Brad Solomon and Carl Kingsford.

The paper outlines a fairly simple and straightforward way to query massive amounts of sequence data (5 TB of mRNAseq!) in very small disk (~70 GB) and memory (~under a GB), fairly quickly (~2.5 days).

The short version is that I think this is an incredibly powerful approach that I am now planning to build on for our own Moore DDD project.

The review was a lot of fun; it's right up our alley, and we've thought a lot about related (but somewhat distinct) issues. Here are some extended comments that I thought were probably not appropriate for the official review because they're somewhat forward looking.

First, we did the review in approved Open Science style. Since Brad and Carl did us the favor of using GitHub (source code) and bioRxiv (preprint), and I sign my reviews, there need be no mystery in this process. Therefore I am posting our review publicly with only minor edits. Moreover, I filed three issues on GitHub (#1, #2, and #4) and submitted two pull requests (#3 and #5), both of which were merged.

Second, because we work in this area and are very interested in it, I put together both a demo of their software (see 2015-sbt-demo) and also did a simple reimplementation in khmer (see 2015-khmer-sequence-bloom-trees) to make sure that their software worked, and that I thoroughly understood the basic issues involved.

Note that we are unable to use their implementation as licensed, as it is under the GPL and would contaminate our source tree, which is under BSD :(.

Third, I have a lot of suggestions and thoughts! For example,

  • The use of RRR compression is awesome and is something we should look into for khmer (dib-lab/khmer#1074).
  • We get a 20% performance increase from a simple optimization applied to our k-mer lookups that could apply to SBTs -- see just-merged dib-lab/khmer#862, by Camille Scott.
  • The authors might also be interested in making use of our HyperLogLog implementation for k-mer cardinality counting, which could help users choose the right size for their Bloom filters.
  • Streaming diginorm/semi-streaming in general (see Crossing the streams) could be a very useful pre-filter for building SBTs. My guess is that with k-mer prefiltering a la digital normalization, there would be no loss to sensitivity but a substantial decrease in memory requirements.
  • It would be really interesting to brainstorm about how far this can be taken. We have reasonably strong evidence & intuition that you can do straight abundance estimation directly off of counting Bloom filters, and it doesn't seem like a stretch to say, "hey, let's store EVERYTHING in an sequence Bloom tree analog and do comparative expression analysis that way!" We don't have an immediate use case for this ourselves, but I'm sure one will present itself...
  • Qingpeng Zhang and I immediately started talking about how to apply this to metagenomics, and the main worry is that the method seems to depend on low diversity of true k-mers. This can partly be mitigated by diginorm and/or semi-streaming k-mer abundance trimming, but ultimately things are going to scale at best with the true size of the De Bruijn graph. It will be interesting to see how this plays out.


by C. Titus Brown at June 24, 2015 10:00 PM

June 23, 2015

Matthieu Brucher

Announcement: Audio TK 0.6.0

The main changes for this release are first trials at modulated filters, C++11 usage (nullptr, override and final), and some API changes (the main process_impl function is now const).

Download link: ATK 0.6.0


* Added override and final keywords in virtual calls
* Changed the API so that process_impl is now const
* Exposed full_setup to the user (direct reset of the internal state, already called when changing sample rate)
* Added LinkWitz-Riley second order low and high path filters
* Fix resetting the internal state of all delays by using full_setup
* Added a CustomFIRFilter with Python wrapper

* Added time-varying IIR filters (variable frequency, coded as transposed direct form II)
* Added second order time varying filter implementations
* Added a RelativePowerFilter with Python wrappers
* Added a DerivativeFilter with Python wrappers
* Added Python wrappers for the InWavFilter
* Fixed some warnings during compilation

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 23, 2015 07:01 AM

Matthew Rocklin

Distributed Scheduling

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We evaluate dask graphs with a variety of schedulers and introduce a new distributed memory scheduler.

Dask.distributed is new and is not battle-tested. Use at your own risk and adjust expectations accordingly.

Evaluate dask graphs

Most dask users use the dask collections, Array, Bag, and DataFrame. These collections are convenient ways to produce dask graphs. A dask graph is a dictionary of tasks. A task is a tuple with a function and arguments.

The graph comprising a dask collection (like a dask.array) is available through its .dask attribute.

>>> import dask.array as da
>>> x = da.arange(15, chunks=(5,))  # 0..14 in three chunks of size five

>>> x.dask  # dask array holds the graph to create the full array
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15)}

Further operations on x create more complex graphs

>>> z = (x + 100).sum()
>>> z.dask
{('x', 0): (np.arange, 0, 5),
 ('x', 1): (np.arange, 5, 10),
 ('x', 2): (np.arange, 10, 15),
 ('y', 0): (add, ('x', 0), 100),
 ('y', 1): (add, ('x', 1), 100),
 ('y', 2): (add, ('x', 2), 100),
 ('z', 0): (np.sum, ('y', 0)),
 ('z', 1): (np.sum, ('y', 1)),
 ('z', 2): (np.sum, ('y', 2)),
 ('z',): (sum, [('z', 0), ('z', 1), ('z', 2)])}

Hand-made dask graphs

We can make dask graphs by hand without dask collections. This involves creating a dictionary of tuples of functions.

>>> def add(a, b):
...     return a + b

>>> # x = 1
>>> # y = 2
>>> # z = add(x, y)

>>> dsk = {'x': 1,
...        'y': 2,
...        'z': (add, 'x', 'y')}

We evaluate these graphs with one of the dask schedulers

>>> from dask.threaded import get
>>> get(dsk, 'z')   # Evaluate graph with multiple threads

>>> from dask.multiprocessing import get
>>> get(dsk, 'z')   # Evaluate graph with multiple processes

We separate the evaluation of the graphs from their construction.

Distributed Scheduling

The separation of graphs from evaluation allows us to create new schedulers. In particular there exists a scheduler that operates on multiple machines in parallel, communicating over ZeroMQ.

This system has a single centralized scheduler, several workers, and potentially several clients.

Clients send graphs to the central scheduler which farms out those tasks to workers and coordinates the execution of the graph. While the scheduler centralizes metadata, the workers themselves handle transfer of intermediate data in a peer-to-peer fashion. Once the graph completes the workers send data to the scheduler which passes it through to the appropriate user/client.


And so now we can execute our dask graphs in parallel across multiple machines.

$ ipython  # On your laptop                 $ ipython  # Remote Process #1:  Scheduler
>>> def add(a, b):                          >>> from dask.distributed import Scheduler
...     return a + b                        >>> s = Scheduler(port_to_workers='4444',
                                            ...               port_to_clients='5555',
>>> dsk = {'x': 1,                          ...               hostname='notebook')
...        'y': 2,
...        'z': (add, 'x', 'y')}            $ ipython  # Remote Process #2:  Worker
                                            >>> from dask.distributed import Worker
>>> from dask.threaded import get           >>> w = Worker('tcp://notebook:4444')
>>> get(dsk, 'z')  # use threads
3                                           $ ipython  # Remote Process #3:  Worker
                                            >>> from dask.distributed import Worker
                                            >>> w = Worker('tcp://notebook:4444')

>>> from dask.distributed import Client
>>> c = Client('tcp://notebook:5555')

>>> c.get(dsk, 'z') # use distributed network

Choose Your Scheduler

This graph is small. We didn’t need a distributed network of machines to compute it (a single thread would have been much faster) but this simple example can be easily extended to more important cases, including generic use with the dask collections (Array, Bag, DataFrame). You can control the scheduler with a keyword argument to any compute call.

>>> import dask.array as da
>>> x = da.random.normal(10, 0.1, size=(1000000000,), chunks=(1000000,))

>>> x.mean().compute(get=get)    # use threads
>>> x.mean().compute(get=c.get)  # use distributed network

Alternatively you can set the default scheduler in dask with dask.set_options

>>> import dask
>>> dask.set_options(get=c.get)  # use distributed scheduler by default

Known Limitations

We intentionally made the simplest and dumbest distributed scheduler we could think of. Because dask separates graphs from schedulers we can iterate on this problem many times; building better schedulers after learning what is important. This current scheduler learns from our single-memory system but is the first dask scheduler that has to think about distributed memory. As a result it has the following known limitations:

  1. It does not consider data locality. While linear chains of tasks will execute on the same machine we don’t think much about executing multi-input tasks on nodes where only some of the data is local.
  2. In particular, this scheduler isn’t optimized for data-local file-systems like HDFS. It’s still happy to read data from HDFS, but this results in unnecessary network communication. We’ve found that it’s great when paired with S3.
  3. This scheduler is new and hasn’t yet had its tires kicked. Vocal beta users are most welcome.
  4. We haven’t thought much about deployment. E.g. somehow you need to ssh into a bunch of machines and start up workers, then tear them down when you’re done. Dask.distributed can bootstrap off of an IPython Parallel cluster, and we’ve integrated it into anaconda-cluster but deployment remains a tough problem.

The dask.distributed module is available in the last release but I suggest using the development master branch. There will be another release in early July.

Further Information

Blake Griffith has been playing with dask.distributed and dask.bag together on data from http://githubarchive.org. He plans to write a blogpost to give a better demonstration of the use of dask.distributed on real world problems. Look for that post in the next week or two.

You can read more about the internal design of dask.distributed at the dask docs.


Special thanks to Min Regan-Kelley, John Jacobsen, Ben Zaitlen, and Hugo Shi for their advice on building distributed systems.

Also thanks to Blake Griffith for serving as original user/developer and for smoothing over the user experience.

June 23, 2015 12:00 AM

June 22, 2015

Titus Brown

We're throwing a Software Carpentry! And it's already full...

Yesterday morning, we announced a Software Carpentry workshop here at UC Davis, running July 6-7 -- see the Web site for more information. I'm organizing, and Easton White and Noam Ross are co-lead instructors. (This is the first workshop I'm running since we became an affiliate!)

I'd love it if you could all come!

...but it's already full-ish.

I announced the workshop on Monday morning via the dib-training mailing list. We opened up 30 seats to all comers, and reserved 30 seats for people from the UC Davis School of Veterinary Medicine, which is hosting the workshop (by employing me).

Within 12 hours, the 30 open seats were filled. Wow.

The VetMed seats will be opened to the waiting list next Monday, and (if no one else signs up ;) there is still plenty of room. So if you're in the area, take a look and sign yourself up!


by C. Titus Brown at June 22, 2015 10:00 PM

June 19, 2015

Titus Brown

Some personal perspectives on academic freedom and free speech

Some background: I'm a white, male, tenured faculty member at UC Davis, and a 3rd generation academic. I work in relatively uncontroversial areas of science (primarily bioinformatics & genomics) at a university that is about as protective of academic freedom as you can get these days. I also live in a country that is (at least formally and legally if not always in practice) more protective of free speech than most countries in the world. For these reasons, I have less to fear from expressing my opinions -- on my blog, on Twitter, or in person -- than virtually anyone, anywhere, ever has.

A week ago, I tweeted Dr. Alice Dreger's article, Wondering if I'm the Next Tim Hunt. I was surprised, frustrated, and upset by the response of a number of colleagues on Twitter, who essentially said "speech that I disagree with is not protected by academic freedom." (This is a bit of a paraphrase, but I think an accurate one; judge for yourselves.)

I must say that I don't really care about Dr. Tim Hunt per se, and that whole issue has been covered very well elsewhere. He made clearly sexist and harmful remarks and is not a credible role model. To the extent that I have anything useful to say, I am worried about the reported actions of UCL in this interview with Mary Collins.

What I am much more worried about is the degree to which academics seem oblivious to the issue of academic freedom. It takes a special kind of obliviousness and subjectivity to look at the history of science and argue that academic researchers should be restricted in the scope of their opinions by their employer. For more on the sordid history of academic freedom, see the wikipedia page.

But, you know what? I don't work on academic freedom, and I'm not a lawyer, so I can't comment on nuances of employment contracts vs teaching vs publication, speech vs action, etc. All I can do is tell you why I care about free speech and academic freedom, and what my personal experiences are in this area, and try to explain my perspective.

Communism and my family

In a very real sense, my brothers and sisters and I only exist because of practical limits on free speech.

My father left the United States after receiving his PhD because he'd briefly been a member of the Communist Party. From personal conversations with him, I believe he was keenly aware of the danger of staying in the US, and knew that his academic career would have been in danger had he remained here.

In England, he met his first wife with whom he had my three oldest siblings, all of whom were born in Europe. (He met my mother when he returned to Princeton.)

Ironically, he managed to run afoul of both the US Government (by being a former member of the Communist Party and leaving their reach) and the Communist Party (from which he was evicted for asking too many questions).

I grew up on stories of him being called into British government buildings to meet with US officials who wanted him to surrender his passport (which would have prevented him from traveling); when he wouldn't surrender his passport, the US officials tried to get the Brits to take it from him. That never happened; the Brit response was to ask for a letter from the US, which of course wasn't forthcoming because all of this was unofficial persecution.

I also grew up on stories of him being wined and dined by the Stasi when he was visiting his first wife's family in Eastern Europe, in attempts to recruit him. This, together with his habit of sending theoretical nuclear physics journals to colleagues in Russia, led to a frightening-in-hindsight visit by several serious men in dark sunglasses from the FBI in the early 80s (I was about 10).

Interestingly, despite having been very politically active early on, my father was completely apolitical during my life. In the last years of his life, I got some minimal insight into this from his recounting of the Peekskill riots, but I never got the whole story.

Academic freedom doesn't protect you from being sued personally

Some 20-odd years ago (I'm feeling old today) I mirrored a spam blacklist site on my employee account at Caltech. (This was back when the Internet was new enough that such things could be maintained somewhat manually. ;) One of the people on the spam blacklist got very upset and sent some very nasty e-mails threatening everyone and everything at Caltech with lawsuits unless we removed it.

The resulting conversation escalated all the way to the provost (who I was actually doing research for at the time - see below), and I had the awkward conversation where I was told:

  • we have no problems hosting this, if you make the following modifications to make it clear this isn't Caltech's official opinion;
  • we're not going to fire you or anything like that;
  • but we won't protect you yourself from libel litigation, so good luck!

Nothing ever happened, but it made the indelible point on me that academic freedom is a thin reed to clutch - people can still bring legal action against you for what you said, even if you face no blowback from your employer.

Climate studies and modeling

I worked for a few years on climate studies with Dr. Steven Koonin, back when he was provost at Caltech (and before he took up his position at BP). My scientific career exists in part because of the work I did with him. I regard him as a mentor, colleague, and friend.

In 2014, Dr. Koonin wrote an op-ed on climate change (here) that I think makes many good points. Knowing him personally, I trust his judgement; having worked (a small bit) in this area, and having a reasonable amount of experience in modeling, I am in agreement with many of his central points.

This measured response is a good example of true scientific debate in action. We need more of this, in general. (I'm a particular fan of Dr. Koonin's suggestion on model evaluation; he's a smart scientist.)

Several people privately told me that they thought Dr. Koonin was an idiot for writing this, and others told me it was our responsibility as scientists to toe the climate change line for fear of doing further damage to the environment. I disagree with both of these groups of people, even though I believe that climate change is anthropogenic and we need to do something about it. I think Dr. Koonin made some good points that needed to be made.

Blogging and intellectual community

About 2-3 times a year, I get a request to change something in a blog post. Very rarely is it because what I've said is wrong; it's usually because it makes someone uncomfortable or unhappy. As a matter of policy, I refuse to do so (plus my blog is under version control, and I'm certainly not going to rewrite my git history :).

(I don't have any problem with posting explicit corrections when I'm wrong, obviously.)

A key point is that I don't expect to be fired for anything I say in my blog posts. Completely apart from having an awful lot of privilege (white, male, tenure, supportive family, no health problems), there's an expectation that what I say on my blog is subject to academic freedom. I've never gotten any pushback from my employer and I don't expect to, no matter how critical I am of them.

Joe Pickrell makes a very good point that intellectual community is key to academia. How can we have robust discussion and without academic freedom? (Rebecca Schuman makes an excellent related point about adjuncts, job security and academic freedom, here, with which I greatly sympathize.)

Privilege, and free speech, and academic freedom

(I'm not a lawyer, so please correct me. This is my understanding.)

Free speech is a constitutional right in the US; as such it only applies to government action. If my employer is upset with my speech, they are free to fire me; Twitter is under no obligation to allow me to tweet whatever I want; etc.

Academic freedom is, essentially, free speech commuted to academic employees: basically, universities should not fire people for something they said. While I am still individually liable for what I say under the law of the country I'm in, my employer cannot fire me without some substantial process (if at all) for what I say.

There are a lot of tricky bits in there, though.

For example, when I wrote on Twitter, "academic ideal: I should be able to hold & defend ideas w/o fear of losing my job", I got a very important response from a colleague -- White men exercising their entitlement to this ideal seems to be at odds with marginalized people gaining the same privileges.

(Please read the rest of that Twitter commentary if you're at all interested in this!)

I don't have a sophisticated response to offer; as a tenured white guy whose research isn't in this area, I am only slowly learning about this area, and a large part of that learning is being open to colleagues who tell me about their experiences (latest horrific example, of many: Julie Libarkin, with whom I work on learning evaluation). For this reason I tend to simply stay quiet and do what I can to foster a welcoming environment. I certainly don't feel qualified to say anything intelligent on the specific question of marginalization.

I do have two tentative thoughts that I keep on coming back to, though, and I'd welcome feedback.

One thought is this: we can only have conversations about sexism and privilege and systemic oppression because of free speech, and, in the university, because discussions of these controversial topics are protected by academic freedom. I have colleagues and mentees who come from "free speech challenged" countries (I'm not being more specific in order to protect them), and the stories they tell me of government and institutional oppression are horrifying. With respect to one actual real-life example that happened to the family of a colleague, I can confirm that I would say virtually anything you want me to if you took my children, put them in a jail cell, and threatened them until I acquiesced. We are fairly far from that in the US (with national security and terrorism being one horrible counterexample), and I value that tremendously. I would hate to see that weakened even in the service of efforts that I believe in passionately.

My other thought is this: limits to academic freedom and free speech are and always have been a double edged sword. This is almost the definition of a "slippery slope" situation - it's very hard to enact precise limitations on free speech that don't have seriously unintended consequences. It's pretty easy to find pairs of examples to juxtapose -- consider gun rights vs animal rights. I bet relatively few people are sympathetic to both lawsuits on any grounds other than academic freedom! But most people will be sympathetic to at least one. How else to square this but academic freedom??

So inasmuch as I have anything to say, it's this: we should be careful what we wish for, because your well-intentioned limits on free speech and academic freedom today will be used used against you tomorrow. And if you don't agree that happens, you are taking an ahistorical position.

Concluding thoughts

There's a long and righteous history of defending the most disgusting and horrifying actions based on due process. For one example, Miranda rights rest on a despicable character, Ernesto Miranda, who was later convicted of some horrible crimes. Presumably most of my readers would agree that Miranda rights are a net win for the rights of the accused, but note that it was controversial -- for example, the Supreme Court decision was 5-4. (The wikipedia page is a very good read.)

So, ultimately, I don't think there's any conflict in arguing for due process or legal protections of free speech, academic freedom, or anything else, no matter how heinous the speech being protected is. And if you disagree, then I think you're not only wrong but dangerously so.

That having been said, I'm unsympathetic to people who want me to host their obnoxious speech. I can't see any reason why I, personally, am required to pay attention to what anyone else is saying. I don't have any reason to put up with (say) sexist speech within my lab, or on my blog. Nor do I have to engage with, pay attention to, or promote, those who have opinions I find to be silly or nonsensical. (One exception here - academic norms require me to engage with those opinions that bear on my own academic research.)


p.s. Respectful comments only, abiding by the Principle of Charity; others may be deleted without notice, and commenters may be banned. My blog, my rules. Read the above if you're confused :).

by C. Titus Brown at June 19, 2015 10:00 PM

Abraham Escalante

Scipy and the first few GSoC weeks

Hi all,

We're about three (and a half) weeks into the GSoC and it's been one crazy ride so far. Being my first experience working in OpenSource projects and not being much of an expert in statistics was challenging at first, but I think I might be getting the hang of it now.

First off, for those of you still wondering what I'm actually doing, here is an abridged version of the abstract from my proposal to the GSoC (or you can click here for the full proposal):

"scipy.stats is one of the largest and most heavily used modules in Scipy. [...] it must be ensured that the quality of this module is up to par and [..] there are still some milestones to be reached. [...] Milestones include a number of enhancements and [...] maintenance issues; most of the scope is already outlined and described by the community in the form of open issues or proposed enhancements."

So basically, the bulk of my project consists on getting to work on open issues for the StatisticsCleanup milestone within the statistics module of SciPy (a Python-based OpenSource library for scientific computing). I suppose this is an unusual approach for a GSoC project since it focuses on maintaining and streamlining an already stable module (in preparation for the release of SciPy 1.0), rather than adding a new module or a specific function within.

The unusual approach allows me to make several small contributions and it gives me a wide (although not as deep) scope, rather than a narrow one. This is precisely the reason why I chose it. I feel like I can benefit (and contribute) a lot more this way, while I get acquainted with the OpenSource way and it also helps me to find new personal interests (win-win).

However, there are also some nuances that may be uncommon. During the first few weeks I have discovered that my proposal did not account for the normal life-cycle of issues and PRs in scipy; my estimations we're too hopeful.

One of OpenSource's greatest strengths is the community getting involved in peer reviews; this allows a developer to "in the face of ambiguity, refuse the temptation to guess". If you didn't get that [spoiler alert] it was a reference to the zen of python (and if you're still reading this and your name is Hélène, I love you).

The problem with this is that even the smooth PRs can take much longer than one week to be merged because of the back and forth with feedback from the community and code update (if it's a controversial topic, discussions can take months). Originally, I had planned to work on four or five open issues a week, have the PRs merged and then continue with the next four or five issues for the next week but this was too naive so I have had to make some changes.

I spent the last week compiling a list of next steps for pretty much all of the open issues and I am now trying to work on as many as I can at a time, thus minimising the impact of waiting periods between feedback cycles for each PR. I can already feel the snowball effect it is having on the project and on my motivation. I am learning a lot more (and in less time) than before which was the whole idea behind doing the Summer of Code.

I will get back in touch soon. I feel like I have rambled on for too long, so I will stop and let you continue to be awesome and get on with your day.


by noreply@blogger.com (Abraham Escalante) at June 19, 2015 12:19 AM

June 18, 2015

Matthew Rocklin

Pandas Categoricals

tl;dr: Pandas Categoricals efficiently encode and dramatically improve performance on data with text categories

Disclaimer: Categoricals were created by the Pandas development team and not by me.

There is More to Speed Than Parallelism

I usually write about parallelism. As a result people ask me how to parallelize their slow computations. The answer is usually just use pandas in a better way

  • Q: How do I make my pandas code faster with parallelism?
  • A: You don’t need parallelism, you can use Pandas better

This is almost always simpler and more effective than using multiple cores or multiple machines. You should look towards parallelism only after you’ve made sane choices about storage format, compression, data representation, etc..

Today we’ll talk about how Pandas can represent categorical text data numerically. This is a cheap and underused trick to get an order of magnitude speedup on common queries.


Often our data includes text columns with many repeated elements. Examples:

  • Stock symbols – GOOG, APPL, MSFT, ...
  • Gender – Female, Male, ...
  • Experiment outcomes – Healthy, Sick, No Change, ...
  • States – California, Texas, New York, ...

We usually represent these as text. Pandas represents text with the object dtype which holds a normal Python string. This is a common culprit for slow code because object dtypes run at Python speeds, not at Pandas’ normal C speeds.

Pandas categoricals are a new and powerful feature that encodes categorical data numerically so that we can leverage Pandas’ fast C code on this kind of text data.

>>> # Example dataframe with names, balances, and genders as object dtypes
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'Danielle'],
...                    'balance': [100.0, 200.0, 300.0, 400.0],
...                    'gender': ['Female', 'Male', 'Male', 'Female']},
...                    columns=['name', 'balance', 'gender'])

>>> df.dtypes                           # Oh no!  Slow object dtypes!
name        object
balance    float64
gender      object
dtype: object

We can represent columns with many repeats, like gender, more efficiently by using categoricals. This stores our original data in two pieces

  • Original data

     Female, Male, Male, Female
  1. Index mapping each category to an integer

    Female: 0
    Male: 1
  2. Normal array of integers

    0, 1, 1, 0

This integer array is more compact and is now a normal C array. This allows for normal C-speeds on previously slow object dtype columns. Categorizing a column is easy:

In [5]: df['gender'] = df['gender'].astype('category')  # Categorize!

Lets look at the result

In [6]: df                          # DataFrame looks the same
       name  balance  gender
0     Alice      100  Female
1       Bob      200    Male
2   Charlie      300    Male
3  Danielle      400  Female

In [7]: df.dtypes                   # But dtypes have changed
name         object
balance     float64
gender     category
dtype: object

In [8]: df.gender                   # Note Categories at the bottom
0    Female
1      Male
2      Male
3    Female
Name: gender, dtype: category
Categories (2, object): [Female, Male]

In [9]: df.gender.cat.categories    # Category index
Out[9]: Index([u'Female', u'Male'], dtype='object')

In [10]: df.gender.cat.codes        # Numerical values
0    0
1    1
2    1
3    0
dtype: int8                         # Stored in single bytes!

Notice that we can store our genders much more compactly as single bytes. We can continue to add genders (there are more than just two) and Pandas will use new values (2, 3, …) as necessary.

Our dataframe looks and feels just like it did before. Pandas internals will smooth out the user experience so that you don’t notice that you’re actually using a compact array of integers.


Lets look at a slightly larger example to see the performance difference.

We take a small subset of the NYC Taxi dataset and group by medallion ID to find the taxi drivers who drove the longest distance during a certain period.

In [1]: import pandas as pd
In [2]: df = pd.read_csv('trip_data_1_00.csv')

In [3]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 161 ms, sys: 0 ns, total: 161 ms
Wall time: 175 ms

1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

That took around 170ms. We categorize in about the same time.

In [4]: %time df['medallion'] = df['medallion'].astype('category')
CPU times: user 168 ms, sys: 12.1 ms, total: 180 ms
Wall time: 197 ms

Now that we have numerical categories our computaion runs 20ms, improving by about an order of magnitude.

In [5]: %time df.groupby(df.medallion).trip_distance.sum().sort(ascending=False,
CPU times: user 16.4 ms, sys: 3.89 ms, total: 20.3 ms
Wall time: 20.3 ms

1E76B5DCA3A19D03B0FB39BCF2A2F534    870.83
6945300E90C69061B463CCDA370DE5D6    832.91
4F4BEA1914E323156BE0B24EF8205B73    811.99
191115180C29B1E2AF8BE0FD0ABD138F    787.33
B83044D63E9421B76011917CE280C137    782.78
Name: trip_distance, dtype: float64

We see almost an order of magnitude speedup after we do the one-time-operation of replacing object dtypes with categories. Most other computations on this column will be similarly fast. Our memory use drops dramatically as well.


Pandas Categoricals efficiently encode repetitive text data. Categoricals are useful for data like stock symbols, gender, experiment outcomes, cities, states, etc.. Categoricals are easy to use and greatly improve performance on this data.

We have several options to increase performance when dealing with inconveniently large or slow data. Good choices in storage format, compression, column layout, and data representation can dramatically improve query times and memory use. Each of these choices is as important as parallelism but isn’t overly hyped and so is often overlooked.

Jeff Reback gave a nice talk on categoricals (and other featuress in Pandas) at PyData NYC 2014 and is giving another this weekend at PyData London.

June 18, 2015 12:00 AM

June 17, 2015

Wei Xue

GSoC Week 4: Progress Report

Updated in Jun 24.

Here is the task check-list.

  1. [x] Completes derivation report.
  2. [x] Adds new classes. One abstract class _BasesMixture. Three derived classes GaussianMixture, BayesianGaussianMixture, DirichletProcessGaussianMixture
  3. [ ] Decouples large functions, especially in DirichletProcessGaussianMixture and BayesianGaussianMixture
  4. [x] Removes numerical stability fixes for HMM. It seems that whenever there is a numerical issue, the code always adds 10*EPS in the computation. I think in some cases there is a better way to address the problem, such as normalization the extremely small variables earlier, or we just simply remove 10*EPS which is only for HMM.
  5. [ ] Writes updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel according to the report.
  6. [x] Provides methods that allow users to initialize the model with user-provided data
  7. [x] Corrects kmeans initialization. It is weird when using kmeans initialization, only means is initialized. The weights and covariances are initialized by averaging.
  8. [x] Writes several checking functions for the initialization data
  9. [x] Adjusts the verbose messages. When verbose>1, it display log-likelihood and time used in the same line of the message Iteration x
  10. [ ] Adjusts the time to compute log-likelihood. The code in the current master branch compute the log-likelihood of the model after E-step which is actually the score of the last iteration, and misses the score immediately after the initialization.
  11. [x] Simplify fit_predict
  12. [x] Adds warning for params!='wmc'
  13. [ ] Studies and contrasts the convergence of classical MLE / EM GMM with Bayesian GMM against the number of samples and the number of components
  14. [ ] Friendly warning and error messages, or automatically addressing if possible (e.g. random re-init of singular components)
  15. [ ] Examples that shows how models can over-fit by comparing likelihood on training and validation sets (normalized by the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
  16. [ ] Testing on 1-D dimensions
  17. [ ] Testing on Degenerating cases
  18. [ ] AIC, BIC for VBGMM DPGMM
  19. [ ] Old faithful geyser data set
  20. [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  21. [optional] ledoit_wolf covariance estimation

The most important progress I have done is the derivation report which include the updating functions, log-probability, and predictive distribution for all three models, and the implementation of the base class. Compared with the current scikit-learn math derivation documents, my report is consistent to PRML. It clearly depicts the updating functions of three models share a lot of patterns. We could abstract common functions into the abstract base class _MixtureBase. The three models could inherit it and override the updating methods.

Next week I will finish the GaussianMixture model with necessary testing functions.

June 17, 2015 08:20 PM

June 16, 2015

Matthieu Brucher

Audio Toolkit: Anatomy of a middle-side compressor

Sometimes images are worth a thousand words, so let’s look at some pictures of a middle-side compressor behavior.

A middle-side compressor like ATKStereoCompressor can work in a middle-side workflow. This means that the stereo signal is split in a center/middle channel and a side (L-R) channel. Then each channel is processed through the compressor independently and then recreated after:

Stereo compressor pipelineStereo compressor pipeline

So let’s take a stereo signal (I won’t extract the middle/side channels, it is easy to get them from AudioTK):
Stereo signal inputStereo signal input

From this, it is easy to generate the RMS power for each channel:
Middle-side power signals (RMS)Middle-side power signals (RMS)

Now, after an attack-release filter, we can apply the gain stage with the desired ratio, knee…
Middle-side gainMiddle-side gain
The gain processed for the two channels are similar, following the same trend, but with vastly different features.

Finally, we can apply the gain on the middle and side channels before regenerating the stereo channels and get the following result:
Stereo output signalStereo output signal

Of course, changing attack/release values will also change the shape of the signals, as well as the ratio: with a higher ratio on the side, the signal will sound more like a mono signal, whereas a higher ratio on the middle, the stereo image will be broadened.

The code to generate these images for any stereo signal is available on github.

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at June 16, 2015 07:01 AM

June 15, 2015

Wei Xue

GSoC Week 3

The week 3 has a very exciting start. I finished the derivation of DPGMM, as well as the lower bound and the predictive probability for each model.

The difference between my derivation and the current document is that the current models assume a simpler approximation. The model defined in PRML is more accurate and provides more knobs. The two approximations both appear in the literature. Maybe we should do some experiments to decide which one is better.

With regards to the new names of DPGMM and VBGMM, I think these two names are not suitable, just like someone calls SVM as SMO. Actually, the models are Bayesian GMM, Dirichlet Process Bayesian GMM (DPGMM is often used) respectively. Both of them are solved by variational inference. In other words, VBGMM is not a good name. The new names, I think, should have the meaning of 'Bayesian GMM solved by VB', 'DP(B)GMM solved by VB'.

I also took a close look at the code base. The code is not maintained well. The problem I am going to address is as follows.

  • decouple some large functions, such as _fit
  • use abstract class and inheritance to reduce code redundancy
  • numerical stability. It seems that whenever there is a numerical issue. The code always like to add EPS. I think in some place there is a better way to address the problem, such as normalization the extremely small variables earlier.
  • write updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel
  • provide methods that allow users to initialize the model before fit
  • correct kmeans initialization. It is weird when using kmean initialization, only means is initialized. The weights and covariances are initialized by averaging.
  • write several checking functions for the initialization data
  • [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  • [optional] ledoit_wolf covariance estimation

The last days of this week I implemented the structure of new classes. _MixtureModelBase, GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel. It provides us a big picture of the classes I am going to implement. I am looking forward the feedback.

June 15, 2015 01:00 AM

June 13, 2015

Jake Vanderplas

Fast Lomb-Scargle Periodograms in Python

Image source: astroML. Source code here

The Lomb-Scargle periodogram (named for Lomb (1976) and Scargle (1982)) is a classic method for finding periodicity in irregularly-sampled data. It is in many ways analogous to the more familiar Fourier Power Spectral Density (PSD) often used for detecting periodicity in regularly-sampled data.

Despite the importance of this method, until recently there have not been any (in my opinion) solid implementations of the algorithm available for easy use in Python. That has changed with the introduction of the gatspy package, which I recently released. In this post, I will compare several available Python implementations of the Lomb-Scargle periodogram, and discuss some of the considerations required when using it to analyze data.

To cut to the chase, I'd recommend using the gatspy package for Lomb-Scargle periodograms in Python, and particularly its gatspy.periodic.LombScargleFast algorithm which implements an efficient pure-Python version of Press & Rybicki's \(O[N\log N]\) periodogram. Below, I'll dive into the reasons for this recommendation.

Example: Lomb-Scargle on Variable Stars

As an motivation, let's briefly consider some data from my own field: observations of an RR Lyrae-type variable star. RR Lyrae are small stars – about 50% the mass of our sun – which pulsate with a regular period on order half a day. Their relatively consistent peak intrinsic brightness allows for an accurate estimation of their distance from the sun, and thus they are important for studies such as understanding the substructure of the Milky Way galaxy. Because of this and other similar applications, detecting the telltale periodic variation of RR Lyrae stars within noisy data is an important statistical task for astronomers.

Here we will quickly demonstrate what this looks like in practice, using tools from the astroML package to download some data, and tools from the gatspy package to detect the periodicity.

We'll start with some typical Python import statements:

In [1]:
# Do preliminary imports and notebook setup
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# use seaborn for plot styles
import seaborn; seaborn.set()

Now we'll download some data from the LINEAR dataset, using tools in astroML. We'll plot the data to see what we're working with:

In [2]:
from astroML.datasets import fetch_LINEAR_sample
LINEAR_data = fetch_LINEAR_sample()
star_id = 10040133
t, mag, dmag = LINEAR_data.get_light_curve(star_id).T

fig, ax = plt.subplots()
ax.errorbar(t, mag, dmag, fmt='.k', ecolor='gray')
ax.set(xlabel='Time (days)', ylabel='magitude',
       title='LINEAR object {0}'.format(star_id))

This data has around 250 observations spread across about 2000 days, and we're hoping to detect a period of order 0.5 days. If the series were regularly-sampled, we'd be far above the Nyquist limit and all hope would be lost. Fortunately for astronomers, the assumptions behind the Nyquist sampling limit do not hold for irregular sampling rates, and we can proceed with no problem.

Let's start by computing and plotting the Lomb-Scargle Periodogram for this data, using tools from gatspy:

In [3]:
from gatspy.periodic import LombScargleFast
model = LombScargleFast().fit(t, mag, dmag)
periods, power = model.periodogram_auto(nyquist_factor=100)

fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(xlim=(0.2, 1.4), ylim=(0, 0.8),
       xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

The periodogram gives a measure of periodic content as a function of period; we see here a strong peak at around 0.61 days. Other lower peaks are due to some combination of higher-order harmonics in the data and effects of the irregular survey window. While we could find this maximum manually from the above grid, gatspy provides a better way: a built-in two-stage grid-search that accurately determines the best period in a specified range:

In [4]:
# set range and find period
model.optimizer.period_range=(0.2, 1.4)
period = model.best_period
print("period = {0}".format(period))
Finding optimal frequency:
 - Estimated peak width = 0.0032
 - Using 5 steps per peak; omega_step = 0.00064
 - User-specified period range:  0.2 to 1.4
 - Computing periods at 42104 steps
Zooming-in on 5 candidate peaks:
 - Computing periods at 1000 steps
period = 0.6105387801103276

We see that the optimizer determined that it needed a grid of over 40,000 points to adequately cover the frequency grid (more on this below), and in the end arrived at a best period of 0.6105 days. Given this detected period, we can fold the input data and over-plot a best-fit empirical RR Lyrae template to see the fit:

In [5]:
# Compute phases of the obsevations
phase = (t / period) % 1

# Compute best-fit RR Lyrae template
from gatspy.periodic import RRLyraeTemplateModeler
model = RRLyraeTemplateModeler('r').fit(t, mag, dmag)
phase_fit = np.linspace(0, 1, 1000)
mag_fit = model.predict(period * phase_fit, period=period)

# Plot the phased data & model
fig, ax = plt.subplots()
ax.errorbar(phase, mag, dmag, fmt='.k', ecolor='gray', alpha=0.5)
ax.plot(phase_fit, mag_fit, '-k')
ax.set(xlabel='Phase', ylabel='magitude')

This very close template fit gives a strong indication that the star in question is an RR Lyrae.

Computational Considerations for Lomb-Scargle

The Lomb-Scargle periodogram involves the computation of a power \(P(\omega)\) at a set of frequencies \(\omega_i\). For data \(\{y_k\}\) pre-centered such that \(\sum_k y_k = 0\), the expression for the power is:

\[ P(\omega) \propto \frac{\left[\sum_k y_k \cos\omega(t_k - \tau)\right]^2} {\sum_k \cos^2\omega(t_k - \tau)} + \frac{\left[\sum_k y_k \sin\omega(t_k - \tau)\right]^2} {\sum_k \sin^2\omega(t_k - \tau)} \]

where \(\tau\) is an easily computed time-offset which orthogonalizes the model and makes \(P(\omega)\) independent of a translation in \(t\).

Rather than get lost in the math, I want to emphasize the key feature of this expression: for any frequency \(\omega\), the power is an \(O[N]\) computation involving simple trigonometric sums over the data, where \(N\) is the number of observed data poitns. The main computational question then becomes: how many frequencies must you compute? In my experience, the most common mistake people make when doing this sort of periodic analysis is not thinking hard enough about the frequency grid. It turns out that the grid-spacing question is very important. If you choose too fine a grid, you do much more computation than is required. Worse, if you choose too coarse a grid, the periodogram peak may fall between grid points and you'll miss it entirely!

Let's think about the required frequency range and frequency spacing for Lomb-Scargle.

Frequency spacing

First we'll choose the spacing of the frequency grid. If you're asking about a candidate frequency \(f\), then data with range \(T = t_{max} - t_{min}\) contains \(T \cdot f\) complete cycles. If our error in frequency is \(\delta f\), then \(T\cdot\delta f\) is the error in number of cycles between the endpoints of the data. It's clear that this error must not be a significant fraction of a cycle, or the fit could be drastically affected. This leads to an approximate grid-spacing criterion:

\[ T\cdot\delta f \ll 1 \]

Commonly, we'll choose some oversampling factor (say, 5) and use \(\delta f = (5T)^{-1}\) as our frequency grid spacing.

Frequency limits

Next, we need to choose the upper and lower limits of the frequency grid. On the low end, \(f=0\) is suitable, but causes some numerical problems – we'll go one step away and use \(\delta f\) as our minimum frequency. But on the high end, we need to make a choice: what's the highest frequency we'd trust our data to be sensitive to? At this point, many people are tempted to mis-apply the Nyquist-Shannon sampling theorem, and choose some version of the Nyquist limit for the data (based on, say, the minimum or mean spacing between observations). But this is entirely wrong! The Nyquist frequency is derived from special properties of regularly-sampled data, and does not apply – even approximately – to irregularly-sampled time-series. In fact, as we saw above, irregularly-sampled data can be sensitive to much, much higher frequencies than even the minimum spacing between observations. With this in mind, the upper limit for frequencies should be determined based on what kind of signal you are looking for.

Still, a common (if dubious) rule-of-thumb is that the high frequency is some multiple of what Press & Rybicki call the "average" Nyquist frequency,

\[ \hat{f}_{Ny} = \frac{N}{2T} \]

This means that the "typical" number of frequencies you'll need is

\[ N_{freq} \sim O\left[\frac{\hat{f}_{Ny}}{\delta f}\right] \sim O\left[\frac{N/(2T)}{1/T}\right] \sim O[N] \]

That is, the number of frequencies to search will scale with the number of data points!

Computational Complexity

From the above considerations, we see that the determination of the optimal Lomb-Scargle period within \(N\) points requires computing an \(O[N]\) expression for power across \(O[N]\) grid points; that is, Lomb-Scargle is naively an \(O[N^2]\) algorithm.

This computational complexity can be improved in one of several ways. Most notably, in a 1989 paper, Press and Rybicki proposed a clever method whereby a Fast Fourier Transform is used on a grid extirpolated from the original data, such that this naively \(O[N^2]\) problem can be solved in \(O[N\log N]\) time. The broad idea is that when you compute sums of sines and cosines for one frequency, this gives you some amount of information about those sums computed at another frequency, and by carefully using all information across a frequency grid, you can significantly reduce the number of required operations.

Thus the fundamental divide between Lomb-Scargle implementations is whether they use the naive \(O[N^2]\) algorithm or the \(O[N\log N]\) algorithm of Press & Rybicki and other similar approaches.

Lomb-Scargle Algorithms in Python

Now we get to the meat of this post: Lomb-Scargle implementations written in Python. If you search this on Google, you'll currently find links to several available implementations. Here I'm going to delve into and compare the following four implementations:

  • scipy.signal.lombscargle, an \(O[N^2]\) implementation from SciPy.
  • astroML.time_series.lomb_scargle, an \(O[N^2]\) implementation from astroML.
  • gatspy.periodic.LombScargle, an \(O[N^2]\) implementation from gatspy.
  • gatspy.periodic.LombScargleFast, an \(O[N\log N]\) implementation, also from gatspy.

Let's see some examples of the above tools:


The SciPy Lomb-Scargle periodogram is a C implementation of the naive \(O[N^2]\) algorithm. The algorithm cannot account for noise in the data, and has some other quirks as well:

  • it requires you to center your data (by subtracting the mean) before computing the periodogram. If you do not, the results will be garbage.
  • it computes the unnormalized periodogram, which can be normalized manually as we'll see below.
  • it takes angular frequencies as the argument.

Let's use scipy's algorithm to plot the periodogram of the data shown above. Note that the results will not be identical, because this algorithm ignores the noise in the data and doesn't fit for the data mean.

Against the above recommendations, we'll choose a simple regular grid in period for the plot:

In [6]:
from scipy.signal import lombscargle

# Choose a period grid
periods = np.linspace(0.2, 1.4, 4000)
ang_freqs = 2 * np.pi / periods

# compute the (unnormalized) periodogram
# note pre-centering of y values!
power = lombscargle(t, mag - mag.mean(), ang_freqs)

# normalize the power
N = len(t)
power *= 2 / (N * mag.std() ** 2)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

Comparing to the first periodogram plot, we see that becuase our period grid here is too coarse at low frequencies, some of the peak structure is missed by this visualization. Consider this a warning against arbitrarily choosing a period gridding!


AstroML has two \(O[N^2]\) implementations of Lomb-Scargle: one in astroML and one in astroML_addons, which is a collection of C extensions which replace slower functionality in the pure-python astroML package. In order to use the faster version, make sure you install both packages; e.g.

$pip install astroML$ pip install astroML_addons

Some important features of astroML's Lomb Scargle periodogram:

  • unlike scipy, it uses an extended periodogram model which can correctly account for uncorrelated Gaussian measurement error.
  • like scipy, it takes angular frequencies as its argument.
  • unlike scipy, it implements a floating mean periodogram, meaning that the data centering required for scipy is not required here, but it goes beyond simple centering: the mean of the data is fit as part of the model, which has advantages in many real-world scenarios. To directly compare to scipy's standard Lomb Scargle pass generalized=False.

Let's repeat the above plot with this periodogram:

In [7]:
from astroML.time_series import lomb_scargle
power = lomb_scargle(t, mag, dmag, ang_freqs)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');


Gatspy's basic Lomb-Scargle algorithm is an \(O[N^2]\) implementation, but is implemented differently than either of the above versions. It uses a direct linear algebra approach which carries some additional computational and memory overhead. The reason for this approach is that it naturally accommodates several extensions to the periodogram, including floating mean, multiple terms, regularization, and multi-band models (more details in VanderPlas & Ivezic (2015), the paper that inspired gatspy).

Gatspy is a pure python package, and thus installation is easy and requires no compilation of C or Fortran code:

$ pip install gatspy

Some important features of this implementation:

  • like astroML, it uses an extended periodogram model which correctly accounts for uncorrelated Gaussian measurement error.
  • unlike astroML, it takes periods as its argument.
  • like astroML, it uses a floating mean model by default. To compare directly to scipy's non-floating-mean model, set fit_offset=False.
  • it has an API inspired by scikit-learn, where the model itself is a class instance, the model is applied to data with a fit() method, and the periodogram is computed via a score() method.

Let's repeat the above periodogram using this tool:

In [8]:
from gatspy.periodic import LombScargle

model = LombScargle(fit_offset=True).fit(t, mag, dmag)
power = model.score(periods)

# plot the results
fig, ax = plt.subplots()
ax.plot(periods, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');


Gatspy's fast Lomb-Scargle is an \(O[N\log N]\) algorithm built on a pure Python/numpy implementation of the Press & Rybicki FFT/extirpolation method. Note that a requirement of this fast algorithm is that it be computed on a regular grid of frequencies (not periods), and so to attain this performance it provides the score_frequency_grid() method which takes 3 arguments: the minimum frequency f0, the frequency spacing df, and the number of grid points N.

Some features of the model

  • like astroML, it uses an extended periodogram model which correctly accounts for uncorrelated Gaussian measurement error.
  • it takes a regular frequency grid as its argument for the fast computation; note that the score() function itself falls back on the slower LombScargle approach above.
  • like astroML, it uses a floating mean model by default. To compare directly to scipy, set fit_offset=False.
  • it has an identical API to the LombScargle object above.

Let's take a look at computing the periodogram:

In [9]:
from gatspy.periodic import LombScargleFast

fmin = 1. / periods.max()
fmax = 1. / periods.min()
N = 10000
df = (fmax - fmin) / N

model = LombScargleFast().fit(t, mag, dmag)
power = model.score_frequency_grid(fmin, df, N)
freqs = fmin + df * np.arange(N)

# plot the results
fig, ax = plt.subplots()
ax.plot(1. / freqs, power)
ax.set(ylim=(0, 0.8), xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

You'll notice here that this approach shows a lot more high-frequency peaks than any of the above versions. This is not because it is computing a different model; it is because we are using a finer frequency grid which does not miss these peaks. The above versions, with a regular grid of 4000 periods miss these important features, and give the user absolutely no warning that these features are missed! Keep this in mind as you choose grid parameters while following the above discussion.

If you want to make sure you're using a sufficient grid, you can use the periodogram_auto() method of LombScargleFast, which computes a sufficient frequency grid for you using the rules-of-thumb discussed in the previous section:

In [10]:
model = LombScargleFast().fit(t, mag, dmag)

period, power = model.periodogram_auto(nyquist_factor=200)

print("period range: ({0}, {1})".format(period.min(), period.max()))
print("number of periods: {0}".format(len(period)))
period range: (0.0764511670428014, 9823.97496499998)
number of periods: 128500

The model decided that we needed over 100,000 periods, between about 0.1 days (which was tuned by the nyquist_factor argument) and about 10,000 days (which is derived from the time-span of the data). Plotting the results as above, we see a similar periodogram:

In [11]:
# plot the results
fig, ax = plt.subplots()
ax.plot(period, power)
ax.set(xlim=(0.2, 1.4), ylim=(0, 1.0),
       xlabel='period (days)',
       ylabel='Lomb-Scargle Power');

The LombScargleFast algorithm computes these \(10^5\) periodogram steps very quickly; I wouldn't suggest any of the other methods with a grid of this size!

Benchmarking Lomb-Scargle Implementations

As a final piece of the picture, let's compare the execution speed of the four approaches. We can do this with IPython's %timeit magic function using the following script. Note that this script will take several minutes to run, as it automatically does multiple passes of each benchmark to minimize system timing variation. For efficiency, we cut-off the slower algorithms at high \(N\):

In [12]:
from scipy.signal import lombscargle as ls_scipy
from astroML.time_series import lomb_scargle as ls_astroML

def create_data(N, rseed=0, period=0.61):
    """Create noisy data"""
    rng = np.random.RandomState(rseed)
    t = 52000 + 2000 * rng.rand(N)
    dmag = 0.1 * (1 + rng.rand(N))
    mag = 15 + 0.6 * np.sin(2 * np.pi * t / period) + dmag * rng.randn(N)
    return t, mag, dmag

def compute_frequency_grid(t, oversampling=2):
    """Compute the optimal frequency grid (**not** angular frequencies)"""
    T = t.max() - t.min()
    N = len(t)
    df = 1. / (oversampling * T)
    fmax = N / (2 * T)
    return np.arange(df, fmax, df)

Nrange = 2 ** np.arange(2, 17)
t_scipy = []
t_astroML = []
t_gatspy1 = []
t_gatspy2 = []

for N in Nrange:
    t, mag, dmag = create_data(N)
    freqs = compute_frequency_grid(t)
    periods = 1 / freqs
    ang_freqs = 2 * np.pi * freqs
    f0, df, Nf = freqs[0], freqs[1] - freqs[0], len(freqs)
    # Don't compute the slow algorithms at very high N
    if N < 2 ** 15:
        t1 = %timeit -oq ls_scipy(t, mag - mag.mean(), ang_freqs)
        t2 = %timeit -oq ls_astroML(t, mag, dmag, ang_freqs)
        t3 = %timeit -oq LombScargle().fit(t, mag, dmag).score_frequency_grid(f0, df, Nf)
    t4 = %timeit -oq LombScargleFast().fit(t, mag, dmag).score_frequency_grid(f0, df, Nf)

When these timings are finished, we can plot the results to get an idea of how the algorithms compare:

In [13]:
fig = plt.figure()
ax = fig.add_subplot(111, xscale='log', yscale='log')
ax.plot(Nrange, t_scipy, label='scipy: lombscargle')
ax.plot(Nrange, t_astroML, label='astroML: lomb_scargle')
ax.plot(Nrange, t_gatspy1, label='gatspy: LombScargle')
ax.plot(Nrange, t_gatspy2, label='gatspy: LombScargleFast')
ax.set(xlabel='N', ylabel='time (seconds)',
       title='Comparison of Lomb-Scargle Implementations')
ax.legend(loc='upper left');

Each model has a characteristic performance curve:

  • The scipy and astroML algorithms show similar behavior: fast \(O[1]\) scaling at the small-\(N\) limit, and clear \(O[N^2]\) scaling at the large-\(N\) limit. SciPy is slightly faster, primarily due to the fact that it computes the simpler noiseless non-floating-mean model.
  • Gatspy's LombScargle also becomes \(O[N^2]\) at large \(N\), but is dominated at small \(N\) by an \(O[N]\) contribution which comes from allocating & building the matrices associated with its linear algebraic approach. As \(N\) grows larger than \(\sim 10^4\), however, gatspy's model begins to beat the performance of the other two \(O[N^2]\) algorithms.
  • Gatspy's LombScargleFast has an upfront \(O[1]\) cost that makes it slower than other approaches at small \(N\), but as \(N\) grows its \(O[N\log N]\) scaling means it dominates the performance of the other approaches by orders of magnitude.

If you'd like to push the speed of the computation even further, there may be some options available. For example, the pynfftls package implements an \(O[N\log N]\) Lomb-Scargle based on the NFFT algorithm, which is similar to the NUFFT that I discussed in a previous post. The pynfftls installation depends on prior installations of the NFFT and FFTW libraries. These libraries are best-in-class implementations of their respective algorithms, and from my past experience with them, I'd expect pynfftls to be around a factor of 10 faster than LombScargleFast with the same \(O[N\log N]\) scaling.

I should mention that I briefly tried installing pynfftls for this post, but ran into difficulties with linking the source to the appropriate C headers and library/shared object files. No doubt with a couple hours of tinkering it could be done, but in a conda world I've found my threshold of tolerance for such installation headaches has gone way down. Package developers take note: in most situations, ease of installation is easily worth a factor of a few in runtime performance. If any readers want to tackle the comparison between LombScargleFast and pynfftls, I'd be intrested to learn whether my factor-of-ten intuition is correct!


If there's anything I want you to take from the above discussion, it's these three points:

  • Naive application of Nyquist-style limits to irregularly-sampled data is 100% wrong. Don't be the next person to make this mistake in the published literature! I've been meaning to write a full rant/post on this subject for a while. Perhaps I will someday.
  • Selection of period/frequency grids for Lomb-Scargle analysis should not be taken lightly. It's very easy to inadvertently use too coarse of a grid, and entirely miss important periodogram peaks!
  • Use gatspy.periodic.LombScargleFast if you want any easy-to-install means of computing a fast, \(O[N\log N]\) Lomb-Scargle periodogram in Python.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at June 13, 2015 09:00 PM

June 11, 2015

Titus Brown

Notes from &quot;How to grow a sustainable software development process (for scientific software)&quot;

I gave a presentation at the BEACON Center's coding group this past Monday; here are my notes and followup links. Thanks to Luiz Irber for scribing!

My short slideshow: here

The khmer project is on github, and we have a tutorial for people who want to try out our development process. khmer has ~5-10 active developers and has had ~60 contributors overall.

Growing a development process

How can you grow a process that meets your needs?

  • use version control and develop on branches;
  • create a checklist to use when merging branches into master;
  • follow the checklist!

(For more checklist motivation, see The Checklist Manifesto by Atul Gawande.)

We do the above as part of our GitHub flow-based development approach.

tl;dr? Grow your checklist slowly, but make everyone adhere to it.

What goes on your checklist?

Ideas for things that could go on your checklist:

  • I ran the tests and they passed!
  • Someone else ran the tests and they passed!
  • A computer ran the tests automatically and they passed! (Continuous Integration)
  • The code formatting guidelines are met. (> 2 people with different coding styles? CHAOS.)
  • The code coverage guidelines are met.
  • A spellchecker was run.
  • Changes were described in a ChangeLog.
  • Commit messages make sense.
  • Code coverage didn't decrease.
  • Checks on specific types of features ("Script parameters should be documented").

I also strongly suggest a two-person merge rule (the primary developer of a feature can not merge that feature); this helps ensure the checklist is followed :)

You can see our checklist for khmer here.


It's important to make the checklist as lightweight as possible, and making sure it addresses useful "pain points" in your developer process; there's a line where people start ignoring the checklist because there's less direct value.

There's no reason to start heavy; you can grow your checklist slowly, as your project accrues experience and developers turn over.


Development process goals

Add features quickly (by using branches) while keeping technical debt manageable!

The concept of technical debt is key - if you let cruft accrue in your codebase, eventually your entire codebase will become unmaintainable and unmanageable.

Other useful links:


by C. Titus Brown at June 11, 2015 10:00 PM

Continuum Analytics

Xray + Dask: Out-of-Core, Labeled Arrays in Python

Xray provides labeled, multi-dimensional arrays. Dask provides a system for parallel computing. Together, they allow for easy analysis of scientific datasets that don’t fit into memory.

by Stephan Hoyer at June 11, 2015 12:00 AM

June 09, 2015

Titus Brown

The challenge for post-publication peer review

On Tuesday, I wrote a draft blog post in response to Michael Eisen's blog post on how Lior Pachter's blog post was a a model for post-publication peer review (PPPR). (My draft post suggested that scientific bloggers aim for inclusivity by adopting a code of conduct and posting explicit site commenting policies).

I asked several people for comments on my post, and a (female) friend who is a non-scientist responded:

My big thing is that if I'm doing something in my spare time, I don't want to be dealing with the same bullshit that I do at work.

While my friend is not a scientist, and her comment spoke specifically to the challenges of women in tech, this comment nails the broader challenge for post-publication peer review: how do we make post-publication peer review a welcoming experience? If it's an optional activity that turns people off, most people won't do it. And I'd love to see more assistant professors, grad students, and women engaging in PPPR. However, if PPPR mirrors the not infrequent assholery of anonymous peer reviews and is only done amongst high-profile tenured faculty (or anonymously), we won't engage with this broader community.

I don't have any one perfect answer, but there are a few things that the tech community has done that nudge things in the right direction. A per-site code of conduct and commenting policies are two obvious idea, and (on Twitter) Joe Pickrell suggested asking people to work on the Principle of Charity which I really like, too.

Michael Eisen had a great comment on a different issue that I think applies here --

it's not really about what "science needs" - it's about not being an asshole

Yep. Any inclusive community building effort should start somewhere near the "don't be an asshole" rule - while a low bar, it should at least be achievable :)

We've had a fairly large amount of high profile aggressive alpha-male behavior in bioinformatics blogging and post-pub peer review, and it's worth thinking hard about whether this is the direction we want to pursue. I hope it's not. Fundamentally, it's our community and we should grow it responsibly. Some ground rules about conduct and commenting would be a good start. There are many interesting questions, e.g. about how to retain intellectual rigor while being welcoming to junior participants who might fear retaliation for critical commentary; I hope we can figure that out. I'm already encouraged by the quiet subculture of preprints and preprint commentary that's sprung up completely apart from Lior's blog.

Other than a general attempt to be more welcoming, I'm not sure what else to do. I'd love to hear people's thoughts on how to grow the practice of post-publication peer review, and I'm happy to post them anonymously for people who don't want to comment themselves.


p.s. There is some irony in referring to Mike and Lior's blogs, which I personally find very unwelcoming in several ways; I encourage them to think about a code of conduct of some sort, along with a commenting policy, so we can know what kind of commentary they intentionally hope to foster on their blogs.

p.p.s. I haven't put a Code of Conduct or commenting policy on my blog yet, I know. I'll get to it soon :)

by C. Titus Brown at June 09, 2015 10:00 PM

Artem Sobolev

Week 1 + 2: Takeoff

The first two weeks have ended, and it's time for a weekly (ahem) report.

Basic implementation outlined in the previous post was rewritten almost from scratch. Now there are 2 implementations of cost function calculation: a fully vectorized (that doesn't scale, but should work fast) and a semi-vectorized (that loops through training samples, but all other operations are vectorized). Meanwhile I work on a large scale version. More on that below.

Also, I wrote a simple benchmark that shows improved accuracy of 1NN with the learned distance, and compares 2 implementations.

There are several issues to solve.

The first and the major one is scalability. It takes $O(N^2 M^2)$ time to compute  NCA's gradient, which is waaay too much even for medium-size datasets. Some ideas I have in mind:

  1. Stochastic Gradient Descent. NCA's loss is a sum of each sample's contribution, so we do stochastic optimization on it reducing computational complexity down to $O(w N M^2)$ where $w$ is a number of iterations.
  2. There's a paper Fast NCA. I briefly skimmed through the paper, but my concern is that they look for $K$ nearest neighbors, which takes them $O(K N^2)$ time — don't look like quite an improvement (Though it's certainly is if you want to project some high-dimensional data to a lower dimensional space).
Another thing to do which is not an issue, but still needs to be done is choosing an optimization algorithm. For now there're 3 methods: gradient descent, gradient descent with AdaGrad and scipy's scipy.optimize.minimize. I don't think it's a good idea to overwhelm a user by the variety of settings with no particular difference in the outcome, so we should get rid of features that are known to be useless.

Also, unit tests and documentation are planned, as well.

by noreply@blogger.com (B@rmaley.exe) at June 09, 2015 01:44 PM

Matthieu Brucher

Book review: C++ Multithreading Cookbook

C++ Multithreading Cookbook in 2014 (publication year), that seems quite interesting, with all the new stuff from the current C++ standard. Is it what the book delivers?

Content and opinions

Unfortunately, when you read the table of contents, there are already orange flags. Chapter 6 is about threads in the .NET framework. What?? This is a book on C++ multithreading, not Windows specific things, right?

OK, let’s start with the beginning. Chapter 1 is a poor presentation of C++. The author says that he will be using Hungarian notation. Actually even Microsoft says do not use it. It predates modern C++, so stop it. Period. I won’t talk about the issues with misunderstanding of a .lib in Windows or what precompiled headers are.

The second chapter is actually interesting. Processes and threads are quite complex beasts, and it is not always properly explained.

The bad gets worse with the third chapter, which is supposed to be about thread. Don’t forget this is a C++ book. C++11 brought native thread management, locks, mutex… But nothing of the sort is here. Only Windows specific C threads (not even C++). Not talking about specific unexplained pragmas. Chapter 4 addresses processes, which is actually not in the C++11 standard. This is where the book could have shined, but no, it has to talk about Message Passing Interface, but this has nothing to do with Message PAssing Interface, which is an official C/Fortran/C++ standard.

By then, I’m fed up with the book, although chapter 7 does pose some good code practices and warnings about concurrency. But then, it goes bad again, even talking about OpenMP (although there is a C++ “version”, no real HPC code actually uses this unusable interface in any properly designed code).


In the end, the book title has nothing to do with the content. It may have been interesting 10 years ago with a title like “Windows C threads and processes”, but it is definitely not worth it with C++11.

by Matt at June 09, 2015 07:45 AM

June 08, 2015

Pierre de Buyl

Understanding the Hilbert curve

The Hilbert curve fills space with good properties for sorting N-dimensional data in a linear fashion. I present an IPython notebook with the complete code to follow the algorithm of C. Hamilton CS-2006-07.

The notebook can be downloaded on my GitHub account: see the notebook | download the notebook (a right-click may help).

Update on 12 june 2015: addition of compact Hilbert indices.

In [17]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import math
from mpl_toolkits.mplot3d import Axes3D
from __future__ import print_function, division

Understanding the Hilbert curve

Author: Pierre de Buyl
This notebook is originally developed in a github repository and presented on my blog. The code is 3-clause BSD and content CC-BY.

In this notebook, I review the algorithm by C. H. Hamilton to compute Hilbert curves and indices. The Hilbert curve is a space-filling curve that has good data locality to map N-dimensional data onto a linear space. It is based on a basis curve that visits the $2^N$ vertices of the N-dimensional binary space and that is replicated recursively $M$ times to map $2^{N M}$ points. The curve is the series of points in N-dimensional space and the index of a point is its place on this curve, computing it is subtle. The algorithm is presented in Compact Hilbert Indices, Dalhousie University Technical Report CS-2006-07, July 2006 (here after [TR]).

My motivations (and yours, maybe) are:

  • Understand the algorithm to implement it in a simulation code. The point is to sort particles spatially for better memory access.
  • Provide a full Python-based implementation that can be used for reference and illustration purposes.
  • Have a clear explanation for later reference.

The notebook is self-contained: it contains all the code for all the intermediate operations defined in [TR]. Hold on for the graphic illustrations, they will come as the algorithm is transformed into code.

Bitwise operations

I review here the bitwise operations in Python, as they will be used throughout. Only integer numbers are used. We define a word length that corresponds to the dimension of the Hilbert curve.

The examples use N=3 so that each single integer in $[0:2^N-1]$ can be mapped to a point in 3 dimension that is one vertex of the unit cube. Bitwise operations act as transformations on the integers and equivalently as geometrical operations in 3 dimensions. The XOR operator represents a reflection with respect to one or several of the symmetry planes of the cube. The rotation operator represents a rotations of a third around the diagonal, i.e. the line joining (0,0,0) and (1,1,1).

The function bin_str helps to visualize bitwise operations by converting an integer to its three bits. The right-most bit represents the x-axis, the middle bit the y-axis and the left-most bit the z-axis. In Python, the AND &, OR | and XOR ^ operators are defined and also the bit shifts (>> to the right, or less significant bits and << to the left or more significant bits). The NOT operator ~ requires further masking: It changes the sign bit of the integer. We wish here to operate on a N-bit space and thus cancel any bit beyond the N less significant bits of the integer. The fact that we remain with N non-zero bits and that NOT is operating properly is checked below. We also need rotation operators (like the shift operators but that transfer the over- and under-flowing bits around the 3-bit space) and a function to extract the i-th bit from a number, bit_component.

In [2]:
N = 3

def bin_str(i):
    """Return a string representation of i with N bits."""
    out = ''
    for j in range(N-1,-1,-1):
        if (i>>j) & 1 == 1:
            out += '1'
            out += '0'
    return out

def rotate_right(x, d):
    """Rotate x by d bits to the right."""
    d = d % N
    out = x >> d
    for i in range(d):
        bit = (x & 2**i)>>i
        out |= bit << (N+i-d)
    return out

def rotate_left(x, d):
    """Rotate x by d bits to the left."""
    d = d % N
    out = x << d
    excess = out 
    out = out & (2**N-1)
    for i in range(d):
        bit = (x & 2**(N-1-d+1+i))>> (N-1-d+1+i)
        out |= bit << i
    return out

def bit_component(x, i):
    """Return i-th bit of x"""
    return (x & 2**i) >> i
In [3]:
print('AND  ', bin_str(1), '&', bin_str(5), '=', bin_str(1 & 5))
print('OR   ', bin_str(1), '|', bin_str(5), '=', bin_str(1 | 5))
print('XOR  ', bin_str(1), '^', bin_str(5), '=', bin_str(1 ^ 5))
print('shift', bin_str(2), '>> 1  =', bin_str(2 >> 1))
print('shift', bin_str(2), '<< 1  =', bin_str(2 << 1))
print('rot  ', bin_str(1), '↻', bin_str(rotate_right(1, 1)), '↻',
      bin_str(rotate_right(1,2)), '↻', bin_str(rotate_right(1,3)))
print('NOT  ', bin_str(1), '      =', bin_str(~1 & 2**N-1))

# verify that '~i & 2**N-1' performs the NOT operation in N-bit space
for i in range(2**N):
    not_i = ~i & 2**N-1
    assert not_i >=0
    assert not_i < 2**N
    assert i & not_i == 0
    assert i | not_i == 2**N-1
AND   001 & 101 = 001
OR    001 | 101 = 101
XOR   001 ^ 101 = 100
shift 010 >> 1  = 001
shift 010 << 1  = 100
rot   001 ↻ 100 ↻ 010 ↻ 001
NOT   001       = 110

The technical reports describes with good details all the intermediate steps of the algorithm using binary operations and sequences. The gray code index of a number i, for instance, is given by gc(i) = i XOR (i ≫ 1).

To understand the building of the Hilbert curve, other quantities are needed. The graphical representation use arrows in each sub-hypercube of the curve to represent the path. The definitions of the gray curve index and of the arrows is given and we use these defintions to illustrate the Hilbert curve afterwards.

The interpretation of gc, e and f below is a binary representation in base N of the vertices of a hypercube (a square for N=2, a cube for N=3, etc). The bits represent the z, y and x coordinates.

In [4]:
# Build all edges of a square and of a cube

edges = [((0,),(1,))]

def add_edges(edges):
    """Extend the list of edges from an hypercube to the list of
    edges of the hypercube of the next dimension."""

    old_edges = list(edges)
    old_points = set( [x[0] for x in old_edges] ) | set( [x[1] for x in old_edges] )
    edges = [ ( (0,)+x[0], (0,)+x[1] ) for x in old_edges ]
    edges.extend( [ ( (1,)+x[0], (1,)+x[1] ) for x in old_edges ] )
    for e in old_points:
        edges.append( ( (0,)+e, (1,)+e) )

    return edges

square_edges = add_edges(edges)
cube_edges = add_edges(square_edges)

cube_xyz = []
for single_edge in cube_edges:
    # The np.inf value breaks the line plot into disconnected parts
cube_xyz = np.array(cube_xyz)

def set_unit_cube(ax, side=1, set_view=(10,-67)):
    """Present the unit cube."""
    ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_zlabel('z');
    ax.set_xticks(range(side+1)); ax.set_yticks(range(side+1)); ax.set_zticks(range(side+1));
    ax.set_xlim(0, side); ax.set_ylim(0,side); ax.set_zlim(0,side);
    if set_view:

fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(111,projection='3d')

ax.plot(*zip(*cube_xyz), color='b')

for i in range(2**N):
    x = bit_component(i, 0)
    y = bit_component(i, 1)
    z = bit_component(i, 2)
    ax.text(x, y+0.05, z+0.05, bin_str(i))

In [5]:
def gc(i):
    """Return the Gray code index of i."""
    return i ^ (i >> 1)

def e(i):
    """Return the entry point of hypercube i."""
    if i==0:
        return 0
        return gc(2*int(math.floor((i-1)//2)))

def f(i):
    """Return the exit point of hypercube i."""
    return e(2**N-1-i) ^ 2**(N-1)

def i_to_p(i):
    """Extract the 3d position from a 3-bit integer."""
    return [bit_component(i,j) for j in (0,1,2)]
In [6]:
print('Gray code', map(lambda i: (gc(i), bin_str(gc(i))), range(2**N)),
      'Entry point', map(lambda i: (e(i), bin_str(e(i))), range(2**N)),
      'Exit point', map(lambda i: (f(i), bin_str(f(i))), range(2**N)),
Gray code
[(0, &apos000&apos), (1, &apos001&apos), (3, &apos011&apos), (2, &apos010&apos), (6, &apos110&apos), (7, &apos111&apos), (5, &apos101&apos), (4, &apos100&apos)]
Entry point
[(0, &apos000&apos), (0, &apos000&apos), (0, &apos000&apos), (3, &apos011&apos), (3, &apos011&apos), (6, &apos110&apos), (6, &apos110&apos), (5, &apos101&apos)]
Exit point
[(1, &apos001&apos), (2, &apos010&apos), (2, &apos010&apos), (7, &apos111&apos), (7, &apos111&apos), (4, &apos100&apos), (4, &apos100&apos), (4, &apos100&apos)]

Visiting the Hilbert curve in the cube

The path of the Hilbert curve without recursion is encoded in the sequence gc(i), whose 3-bit values are interpreted a points in 3D (see above), and that visits the edges of the cube. These coordinates (scaled by 1/2) are stored in xyz.

The first figure displays the curve visiting the centers of the subcubes, one of the typical representations of the Hilbert curve. In [TR], the path of the Hilbert curve proceeds by entering subcube gc(i) at point e(i) and leaving it a point f(i). The entry and exit points e(i) and f(i) are stored in e_xyz and f_xyz and shown below. The last figure represents the full path as the sequences of arrows joining those points.

In [7]:
# Obtain the 3 d coordinates for gc(i), e(i) and f(i) to build the Hilbert curve
xyz = np.array(map(lambda i: i_to_p(gc(i)), range(2**N)))/2
e_xyz = np.array(map(lambda i: i_to_p(e(i)), range(2**N)))/2
f_xyz = np.array(map(lambda i: i_to_p(f(i)), range(2**N)))/2

fig = plt.figure(figsize=(14, 12))

ax1 = fig.add_subplot(221,projection='3d')

for i, (x,y,z) in enumerate((xyz+0.27)):
    ax1.text(x, y, z, str(i))
ax1.set_title('The Hilbert curve, subcube centered')

edges = []
for e_point, f_point, origin in zip(e_xyz, f_xyz, xyz):
    dir = (f_point-e_point)
    # For the 3d quiver plot, the origin of the arrow is shifted
    # which is why we use origin+e_point+dir instead of origin+e_point
    edges.append(np.concatenate((origin+e_point+dir, dir)))

ax2 = fig.add_subplot(222,projection='3d')
ax2.quiver(*zip(*edges), length=0.5)
ax2.set_title('Arrows to visit the Hilbert curve')

# A naive sequence of binary points in 3D
c = 0.25 + np.array(map(i_to_p, range(2**N)))/2

ax3 = fig.add_subplot(223,projection='3d')
ax3.plot(*zip(*c), color='b')
for i, (x,y,z) in enumerate(c):
    ax3.text(x, y, z, str(i))

ax3.set_title('A naive curve to visit the cube')

for ax in [ax1, ax2, ax3]:
    ax.plot(*zip(*cube_xyz), color='k', zorder=1, alpha=0.5)

Define the curve directions and transforms

To write the Hilbert curve algorithm, we still need to define the inter-subcube direction g and the intra-subcube direction d (the direction of the arrow that connect e and f).

In [8]:
def inverse_gc(g):
    """The inverse gray code."""
    i = g
    j = 1
    while j<N:
        i = i ^ (g >> j)
        j = j + 1
    return i

def g(i):
    """The direction between subcube i and the next one"""
    return int(np.log2(gc(i)^gc(i+1)))

def d(i):
    """The direction of the arrow whithin a subcube."""
    if i==0:
        return 0
    elif (i%2)==0:
        return g(i-1) % N
        return g(i) % N

def T(e, d, b):
    """Transform b."""
    out = b ^ e
    return rotate_right(out, d+1)

def T_inv(e, d, b):
    """Inverse transform b."""
    return T(rotate_right(e, d+1), N-d-2, b)

Validating the relations

The algorithm for the Hilbert curve is based on a series of relations between e, f, d and g. We can check those easily now. In order, we verify that:

  • Only the g(i)-th bit is changed between gc(i) and gc(i+1)
  • The tranform T sends an entry point e(i) to zero and the exit point f(i) to $2^{N-1}$: $T_{e(i),d(i)}(e(i))=0$ and $T_{e(i),d(i)}(f(i))=f(i)$
  • The entry point e(i) reflected along the direction d(i) is the exit point f(i): $e(i)$ ^ $d(i) = f(i)$
  • The inverse transform $T^{-1}$ composed with the direct transform $T$ is the identity.
In [9]:
# only g(i)-th bit changes from gc(i) to gc(i+1)
for i in range(2**N-1):
    assert gc(i) ^ (1 << g(i)) == gc(i+1)

# T sends e(i) to 0 and f(i) to 2**(N-1)
for i in range(2**N):
    assert T(e(i), d(i), e(i))==0
    assert T(e(i), d(i), f(i))==2**(N-1)

# e(i) reflected in direction d(i) is f(i)
for i in range(2**N):
    assert e(i) ^ (1<<d(i)) == f(i)

# T_inv composed with T (and vice versa) is the identity operator
for i in range(2**N):
    for b in range(2**N):
        assert T(e(i), d(i), T_inv(e(i), d(i), b)) == b
        assert T_inv(e(i), d(i), T(e(i), d(i), b)) == b

Transform the curve for the subcubes

According to [TR], one can use the transform T to map a subcube into the larger cube. Using the inverse transform, it is thus possible to generate the recursed Hilbert curve by mapping the main curve into each subcube. The coordinates of this refined Hilbert curve lie within the subcubes and must be translated by the corresponding origins to fill space correctly.

We construct below the refined Hilbert curve by this manual procedure and display the individual subcubes. The coordinates are then used to show the directed path followed by the curve.

In [10]:
fig = plt.figure(figsize=(12, 18))
ax = fig.add_subplot(211,projection='3d')

# store the Hilbert curve with no recursion
main_curve = [gc(i) for i in range(2**N)]
# list of the in-subcube curves
sub_curves = []
for i, h_idx in enumerate(main_curve):
    # append points from the subcube (obtained with the inverse transform T_inv
    points = np.array( [ i_to_p(T_inv(e(i), d(i), x)) for x in main_curve ] )
    # Translate the points by the origin of the subcube
    points += np.array(i_to_p(h_idx))*2
    sub_curves.append( points )

# plot all subcurves separately
[ax.plot(*zip(*c)) for c in sub_curves]
ax.set_title('Individual Hilbert "subcubes"')
set_unit_cube(ax, 3,  set_view=False)

ax = fig.add_subplot(212,projection='3d')

# prepare the data to draw arrows with the quiver command
cc = np.concatenate(sub_curves)
U = cc[1:,0]-cc[:-1,0]
X = cc[:-1,0]+U
V = cc[1:,1]-cc[:-1,1]
Y = cc[:-1,1]+V
W = cc[1:,2]-cc[:-1,2]
Z = cc[:-1,2]+W
ax.quiver(X,Y,Z,U,V,W,color=[matplotlib.cm.gnuplot(x) for x in np.linspace(0, 1,len(X)*3)], length=1)
ax.set_title('Connected Hilbert "subcubes"')
set_unit_cube(ax, 3, set_view=False)

Implementation of the algorithms

Now that all the components of the algorithm have been reviewed, it is time for the actual code. In [TR], the algorithm is listed explicitly in pseudo-code.

  • Algorithm 2, implemented in TR_algo2, transforms a set of integer coordinates into its Hilbert index.
  • Algorithm 3, implemented in TR_algo3, transforms a Hilbert index into a set of coordinates.

Algorithm 2 works by computing the Hilbert index at subsequent refinement levels. Starting from the most significant bits of the coordinates p, it computes the subcube to which p belongs as a label l. l encodes in its N bits the subcube of interest as was show in the unit cube earlier. From l, it is easy to obtain the Hilbert index. The final Hilbert index is composed of this data packed in a single integer variable. As it proceeds in the subcube, the algorithm composes the transform T by updating the current entry point ve direction vd according to Lemma 2.13 of [TR].

Algorithm 3 works in a similar manner but extracts the relevant bits of the Hilbert index at each refinement level to construct the coordinates of the point.

In [11]:
M = 2
def TR_algo2(p):
    """Return the Hilbert index of point p"""
    # h will contain the Hilbert index
    h = 0
    # ve and vd contain the entry point and dimension of the current subcube
    # we choose here a main traversal direction N-2 (i.e. z for a cube) to match
    # the illustrations
    ve = 0
    vd = 2
    for i in range(M-1, -1, -1):
        # the cell label is constructed in two steps
        # 1. extract the relevant bits from p
        l = [bit_component(px, i) for px in p]
        # 2. construct a integer whose bits are given by l
        l = sum( [lx*2**j for j, lx in enumerate(l)] )
        # transform l into the current subcube
        l = T(ve, vd, l)
        # obtain the gray code ordering from the label l
        w = inverse_gc(l)
        # compose (see [TR] lemma 2.13) the transform of ve and vd
        # with the data of the subcube
        ve = ve ^ (rotate_left(e(w), vd+1))
        vd = (vd + d(w) + 1) % N
        # move the index to more significant bits and add current value
        h = (h << N) | w
    return h

def TR_algo3(h):
    """Return the coordinates for the Hilbert index h"""
    ve = 0
    vd = 2
    p = [0]*N
    for i in range(M-1, -1, -1):
        w = [bit_component(h, i*N+ii) for ii in range(N)]
        #print(i, w)
        w = sum( [wx*2**j for j, wx in enumerate(w)] )
        #print(i, w, gc(w))
        l = gc(w)
        l = T_inv(ve, vd, l)
        for j in range(N):
            p[j] += bit_component(l, j) << i
        ve = ve ^ rotate_left(e(w), vd+1)
        vd = (vd + d(w) + 1) % N
    return p

Displaying and checking the algorithm

A basic consistency check is to verify that the successive application of TR_algo3 and TR_algo2 returns the same number than the one given.

A visual inspection of the Hilbert curve on which we superimpose the Hilbert indices confirms the consistency of the algorithm.

In [12]:
# Verifying that the algorithm and its inverse agree
for h_idx in range(2**(N*M)):
    assert TR_algo2(TR_algo3(h_idx)) == h_idx
In [13]:
fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(111,projection='3d')

ax.quiver(X,Y,Z,U,V,W,color=[matplotlib.cm.gnuplot(x) for x in np.linspace(0, 1,len(X)*3)], length=1)
ax.set_title('Hilbert indices in 3D space')

for h in range(2**(N*M)):
    x, y, z = TR_algo3(h)
    ax.text(x, y, z+0.1, str(h))

set_unit_cube(ax, 3, set_view=(15,-55))