July 28, 2014

Manoj Kumar


Hi, It has been a long time since I had posted something on my blog. I had the opportunity to participate in the scikit-learn sprint recently, with the majority of the core-developers. The experience was awesome, but most of the time I had no idea what people were talking about, and I realised I have to learn a lot. I read somewhere that if you need to keep improving in life, you need to make sure the worst person in the job, and if that is meant to be true, I’m well on the right track.

Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( https://github.com/scikit-learn/scikit-learn/pull/2862 ) and we shall have a quick look at the background, what it can do and what it cannot.

1. What is Logistic Regression?
A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector \omega sucht that it fits the Logistic_log function, \frac{1}{1 + e^{-w'*X}} . A quick look at the graph (taken from wikipedia), when y is one, we need our estimator to predict w'X to be infinity and vice versa.


Now if we want to fit labels [-1, 1] the sigmoid function becomes \frac{1}{1 + e^{-y*w'*X}}. The logistic loss function is given by, log(1 + e^{-y*w*x}. Intuitively this seems correct because when y is 1, we need our estimator to predict w*x to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict w*x to be -1. Our basic focus is to optimize for loss.

2. How can this be done?
This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian

3. Doesn’t scikit-learn have a Logistic Regression already?
Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.
a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.
b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.

4. Things that I learnt
Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,
a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.
b] Parallel OvA – For each label, we perform a OvA, that is to convert y into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.
c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.

5. Problems that I faced.
a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.

b] Using liblinear as an optimiser or a solver for the OvA case.

i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this https://github.com/scikit-learn/scikit-learn/pull/2862#issuecomment-49450285

Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month
1. Finish work on Larsmans PR
2. Look at glmnet for further improvements in the cd_fast code.
3. ElasticNet regularisation from Lightning.

by Manoj Kumar at July 28, 2014 11:36 PM

Titus Brown

The Moore DDD Investigator competition: my presentation

Here are my talk notes for the Data Driven Discovery grant competition ("cage match" round). Talk slides are on slideshare

Hello, my name is Titus Brown, and I'm at Michigan State University where I run a biology group whose motto is "better science through superior software". I'm going to tell you about my vision for building infrastructure to support data-intensive biology.

Our research is focused on accelerating sequence-based biology with algorithmic prefilters that help scale downstream sequence analysis. The basic idea is to take the large amounts of data coming from sequencers and squeeze out the information in the data for downstream software to use. In most cases we make things 10 or 100x easier, in some cases we've been able to make analyses possible where they weren't doable before;

In pursuit of this goal, we've built three super awesome computer science advances: we built a low-memory approach for counting sequence elements, we created a new data structure for low-memory graph storage, and developed a streaming lossy compression algorithm that puts much of sequence analysis on a online and streaming basis. Collectively, these are applicable to a wide range of basic sequence analysis problems, including error removal, species sorting, and genome assembly.

We've implemented all three of these approaches in an open source analysis package, khmer, that we developed and maintain using good software engineering approaches. We have primarily focused on using this to drive our own research, but since you can do analyses with it can't be done any other way, we've had some pretty good adoption by others. It's a bit hard to tell how many people are using it because of the many ways people can download it, but we believe it to be in the 1000s and we know we have dozens of citations.

The most important part of our research is this: we have enabled some excellent biology! For a few examples, we've assembled the largest soil metagenome ever with Jim Tiedje and Janet Jansson, we've helped look at deep sea samples of bone eating worms with Shana Goffredi, we're about to publish the largest de novo mRNASeq analysis ever done, and we're enabling evo devo research at the phylogenetic extremes of the Molgula sea squirts. This was really what we set out to do at the beginning but the volume and velocity of data coming from sequencers turned out to be the blocking problem.

Coming from a bit of a physics background, when I started working in bioinformatics 6 years ago, I was surprised at our inability to replicate others' results. One of our explicit strategies now is to try to level up the field by doing high quality, novel, open science. For example, our lamprey analysis is now entirely automated, taking three weeks to go from 3 billion lamprey mRNASeq reads to an assembled, annotated transcriptome that we can interactively analyze in an IPython Notebook, which we will publish with the paper. Camille, who is working on this, is a combination software engineer and scientist, and this has turned out to be a really productive combination.

We've also found that 1000s of people want to do the kinds of things we're doing, but most don't have the expertise or access to computational infrastructure. So, we're also working on open protocols for analyzing sequence data in the cloud - going from raw mRNASeq data to finished analysis for about $100. These protocols are open, versioned, forkable, and highly replicable, and we've got about 20 different groups using them right now.

So that's what I work on now. But looking forward I see a really big problem looming over molecular biology. Soon we will have whatever 'omic data set we want from, to whatever resolution we want, limited only by sampling. But we basically don't have any good way of analyzing these data sets -- most groups don't have the capacity or capability to analyze them themselves, we can't store these data sets in one place and -- perhaps the biggest part of the catastrophe -- people aren't making these data available until after publication, which means that I expect many of them to vanish. We need to incentivise pre-publication sharing by making it useful to share your data. We can do individual analyses now, but we're missing the part that links these analyses to other data sets more broadly.

My proposal, therefore, is to build a distributed graph database system that will allow people to interconnect with open, walled-garden, and private data sets. The idea is that researchers will spin up their own server in the cloud, upload their raw or analyzed data, and have a query interface that lets them explore the data. They'll also have access to other public servers, and be able to opt-in to exploring pre-published data; this opt-in will be in the form of a walled-garden approach where researchers who use results from analyzing other unpublished data sets will be given citation information to those data sets. I hope and expect that this will start to incentivise people to open their data sets up a bit, but to be honest if all it does is make it so that people can analyze their own data in isolation it will already be a major win.

None of this is really a new idea. We published a paper exploring some of these ideas in 2009, but have been unable to find funding. In fact, I think this is the most traditionally unfundable project I have ever proposed, so I hope Moore feels properly honored. In any case, the main point here is that graph queries are a wonderful abstraction that lets you set up an architecture for flexibly querying annotations and data when certain precomputed results already exist. The pygr project showed me the power of this when implemented in a distributed way and it's still the best approach I've ever seen implemented in bioinformatics.

The idea would be to enable basic queries like this across multiple servers, so that we can begin to support the queries necessary for automated data mining and cross-validation.

My larger vision is very buzzwordy. I want to enable frictionless sharing, driven by immediate utility. I want to enable permissionless innovation, so that data mining folk can try out new approaches without first finding a collaborator with an interesting data set, or doing a lot of prep work. By building open, federated infrastructure, and avoiding centralized infrastructure, I am planning for poverty: everything we build will be sustainable and maintainable, so when my funding goes away others can pick it up. And my focus will be on solving people's current problems, which in biology are immense, while remaining agile in terms of what problems I tackle next.

The thing is, everybody needs this. I work across many funding agencies, and many fields, and there is nothing like this currently in existence. I'm even more sure of this because I posted my Moore proposal and requested feedback and discussed it with a number of people on Twitter. NGS has enabled research on non-model organisms but its promise is undermet due to lack of cyberinfrastructure, basically.

How would I start? I would hire two domain postdocs who are tackling challenging data analysis tasks, and support them with my existing lab; this would involve cross-training the postdocs in data intensive methodologies. For example, one pilot project is to work on the data from the DeepDOM cruise, where they did multi-omic sampling across about 20 points in the atlantic, and are trying to connect the dots between microbial activity and dissolved organic matter, with metagenomic and metabolomic data.

Integrated with my research, I would continue and expand my current efforts in training. I already run a number of workshops and generate quite a bit of popular bioinformatics training material; I would continue and expand that effort as part of my research. One thing that I particularly like about this approach is that it's deeply self-interested: I can find out what problems everyone has, and will be having soon, by working with them in workshops.

by C. Titus Brown at July 28, 2014 10:00 PM

Issam Laradji

(GSoC 2014) Progress report for 07/27/14

Great progress! I submitted implementations of multi-layer perceptron (mlp-link) (mlp-pretraining-link) and extreme learning machines (elm-link) and their documentations as well. Yet many improvements could be made through revisions and my mentors' momentous support that they always provided throughout the summer.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)  Documentation

I  wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses.  It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2) Example

I added an example illustrating the effect of weight_scale and C, two hyperparameters in ELM.

C is a regularization term that constrains the coefficients of the hidden-to-output weights
weight_scale scales the range of values that input-to-hidden weights can take.

Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Figure 1: Effect of varying the regularization term C on variance.

Figure 2: Effect of varying the regularization term weight_scale on variance.

As shown, increasing the value of weight_scale or C makes for a more nonlinear decision function as you may notice the plots corresponding to higher values are of more curvy structure.

I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is,  Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.

by noreply@blogger.com (Issam Laradji) at July 28, 2014 05:57 AM

July 27, 2014

Hamzeh Alsalhi

Sparse Output Dummy Classifier

The Scikit-learn dummy classifier is a simple way to get naive predictions based only on the target data of your dataset. It has four strategies of operation.

  • constant - always predict a value manually specified by the use
  • uniform - label each example with a label chosen uniformly at random from the target data given
  • stratified  - label the examples with the class distribution seen in the training data
  • most-frequent - always predict the mode of the target data
The dummy classifier has built in support for multilabel-multioutput data. I have made a pull request #3438 this week that has introduced support for sparsely formatted output data. This is useful because memory consumption can be vastly improved when the data is highly sparse. Below a benchmark these changes with two memory consumption results graphed for each of the four strategies, once in with sparsely formatted target data and once with densely formatted data as the control.

Benchmark and Dataset

I used the Eurlex eurovoc dataset available here in libsvm format for use with the following script.  The benchmark script will let you recreate the results in this post easily. When run with the python module memory_profiler it measures the total memory consumed when doing an initialization of a dummy classifier, along with a fit and predict on the Eurlex data.

The dataset used has approximately 17,000 samples and 4000 classes for the training target data, and the test data is similar. They both have sparsity of 0.001.

Results Visualized

Constant Results: Dense 1250 MiB, Sparse 300 MiB

The constants used in the fit have a level of sparsity similar to the data because they were chosen as an arbitrary row from the target data.

Uniform Results: Dense 1350 MiB, Sparse 1200 MiB

Stratified Results: Dense 2300 MiB, Sparse 1350 MiB

Most-Frequent Results: Dense 1300 MiB, Sparse 300 MiB


We can see that in all cases expect for Uniform we get significant memory improvements by supporting sparse matrices. The sparse matrix implementation for uniform is not useful because of the dense nature of the output even when the input shows high levels of sparsity. It is possible this case will be revised to warn the user or even throw an error.

Remaining Work

There is work to be done on this pull request to make the predict function faster in the stratified and uniform cases when using sparse matrices. Although the uniform cases is not important in itself the underlying code for generating sparse random matrices is used in the stratified case. Any improvements to uniform will come for free is the stratified case speed is improved.

Another upcoming focus is to return to the sparse output knn pull request and make some improvements. There will be code written in the sparse output dummy pull request for gathering a class distribution from a sparse target matrix that can be abstracted to a utility function and will be reusable in the knn pull request.

by noreply@blogger.com (Hamzeh) at July 27, 2014 02:04 AM

July 24, 2014

Gaël Varoquaux

The 2014 international scikit-learn sprint

A week ago, the 2014 edition of the scikit-learn sprint was held in Paris. This was the third time that we held an internation sprint and it was hugely productive, and great fun, as always.

Great people and great venues

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

  • Laurent Direr
  • Michael Eickenberg
  • Loic Esteve
  • Alexandre Gramfort
  • Olivier Grisel
  • Arnaud Joly
  • Kyle Kastner
  • Manoj Kumar
  • Balazs Kegl
  • Nicolas Le Roux
  • Andreas Mueller
  • Vlad Niculae
  • Fabian Pedregosa
  • Amir Sani
  • Danny Sullivan
  • Gabriel Synnaeve
  • Roland Thiolliere
  • Gael Varoquaux

As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

  • La paillasse, a Paris bio-hacker space
  • INRIA, the French computer-science national research, and the place where I work :)
  • Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris.
    And they even had a social event with free drinks one evening.
  • Tinyclues, a French startup mining e-commerce data.

I must say that we were treated like kings during the whole stay; each host welcoming us as well they could. Thank you to all of our hosts!

Sponsored by the Paris-Saclay Center for Data Science

Beyond our hosts, we need to thank the Paris-Saclay Center for Data Science. The CDS gave us funding that covered some of the lunches, acomodations, and travel expenses to bring in our
contributors from abroad.

Achievements during the sprint

The first day of the sprint was dedicated to polishing the 0.15 release, which was finally released on the morning of the second day, after 10 months of development.

A large part of the efforts of the sprint were dedicated to improving the coding base, rather than directly adding new features. Some files were reorganized. The input validation code was cleaned up (opening the way for better support of pandas structures in scikit-learn). We hunted dead code, deprecation warnings, numerical instabilities and tests randomly failing. We made the test suite faster, and refactored our common tests that scan all the model.

Some work of our GSOC student, Manoj Kumar, was merged, making some linear models faster.

Our online documentation was improved with the API documentation pointing to examples and source code.

Still work in progress:

  • Faster stochastic gradient descent (with AdaGrad, ASGD, and one day SAG)
  • Calibration of probabilities for models that do not have a ‘predict_proba’ method
  • Warm restart in random forests to add more estimators to an existing ensemble.
  • Infomax ICA algorithm.

by gael at July 24, 2014 10:59 PM

Maheshakya Wijewardena

Testing LSH Forest

There are two types of tests to perform in order to ensure the correct functionality the LSH Forest.

  1. Tests for individual functions of the LSHForest class.
  2. Tests for accuracy variation with parameters of implementation.

scikit-learn provides a nice set testing tools for this task. It is elaborated in the utilities for developers section. I have used following assertions which were imported from sklearn.utils.testing. Note that numpy as imported as np.

  1. assert_array_equal - Compares each element in an array.
  2. assert_equal - Compares two values.
  3. assert_raises - Checks whether the given type of error is raised.

Testing individual functions

Testing fit function

Requirement of this test is to ensure that the fit function does not work without the necessary arguments provision and it produces correct attributes in the class object(in the sense of dimensions of arrays).

Suppose we initialize a LSHForest as lshf = LSHForest()

If the estimator is not fitted with a proper data, it will produce a value error and it is testes as follows:

    X = np.random.rand(samples, dim)
    lshf = LSHForest(n_trees=n_trees)

We define the sample size and the dimension of the dataset as samples and dim respectively and the number of trees in the LSH forest as n_trees.

    # Test whether a value error is raised when X=None
    assert_raises(ValueError, lshf.fit, None)

Then after fitting the estimator, following assertions should hold true.

    # _input_array = X
    assert_array_equal(X, lshf._input_array)
    # A hash function g(p) for each tree
    assert_equal(n_trees, lshf.hash_functions_.shape[0])
    # Hash length = 32
    assert_equal(32, lshf.hash_functions_.shape[1])
    # Number of trees in the forest
    assert_equal(n_trees, lshf._trees.shape[0])
    # Each tree has entries for every data point
    assert_equal(samples, lshf._trees.shape[1])
    # Original indices after sorting the hashes
    assert_equal(n_trees, lshf._original_indices.shape[0])
    # Each set of original indices in a tree has entries for every data point
    assert_equal(samples, lshf._original_indices.shape[1])

All the other tests for functions also contain the tests for valid arguments, therefore I am not going to describe them in those sections.

Testing kneighbors function

kneighbors tests are based on the number of neighbors returned, neighbors for a single data point and multiple data points.

We define the required number of neighbors as n_neighbors and crate a LSHForest.

    n_neighbors = np.random.randint(0, samples)
    point = X[np.random.randint(0, samples)]
    neighbors = lshf.kneighbors(point, n_neighbors=n_neighbors,
    # Desired number of neighbors should be returned.
    assert_equal(neighbors.shape[1], n_neighbors)

For multiple data points, we define number of points as n_points:

    points = X[np.random.randint(0, samples, n_points)]
    neighbors = lshf.kneighbors(points, n_neighbors=1,
    assert_equal(neighbors.shape[0], n_points)

The above tests ensures that the maximum hash length is also exhausted because the data points in the same data set are used in queries. But to ensure that hash lengths less than the maximum hash length also get involved, there is another test.

    # Test random point(not in the data set)
    point = np.random.randn(dim)
    lshf.kneighbors(point, n_neighbors=1,

Testing distances of the kneighbors function

Returned distances should be in sorted order, therefore it is tested as follows: Suppose distances is the returned distances from the kneighbors function when the return_distances parameter is set to True.

    assert_array_equal(distances[0], np.sort(distances[0]))

Testing insert function

Testing insert is somewhat similar to testing fit because what we have to ensure are dimensions and sample sizes. Following assertions should hold true after fitting the LSHFores.

    # Insert wrong dimension
    assert_raises(ValueError, lshf.insert,
    # Insert 2D array
    assert_raises(ValueError, lshf.insert,
                  np.random.randn(dim, 2))


    # size of _input_array = samples + 1 after insertion
    assert_equal(lshf._input_array.shape[0], samples+1)
    # size of _original_indices[1] = samples + 1
    assert_equal(lshf._original_indices.shape[1], samples+1)
    # size of _trees[1] = samples + 1
    assert_equal(lshf._trees.shape[1], samples+1)

Testing accuracy variation with parameters

The accuracy of the results obtained from the queries depends on two major parameters.

  1. c value - c.
  2. number of trees - n_trees.

Separate tests have been written to ensure that the accuracy variation is correct with these parameter variation.

Testing accuracy against c variation

Accuracy should be in an increasing order as the value of c is raised. Here, the average accuracy is calculated for different c values.

    c_values = np.array([10, 50, 250])

Then the calculated accuracy values of each c value is stored in an array. Following assertion should hold true in order to make sure that, higher the c value, higher the accuracy.

    # Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

Testing accuracy against n_trees variation

This is as almost same as the above test. First, n_trees are stored in the ascending order.

    n_trees = np.array([1, 10, 100])

After calculating average accuracies, the assertion is performed.

    # Sorted accuracies should be equal to original accuracies
    assert_array_equal(accuracies, np.sort(accuracies))

What is left?

In addition to the above tests, precision should also be tested against c values and n_trees. But it has already been tested in the prototyping stage and those tests consume a reasonably large amount of time which makes them unable to be performed in the scikit-learn test suit. Therefore, a separate benchmark will be done on this topic.

About Nose

As provided in the guidelines in scikit-learn contributors' page, Nose testing framework has been used to automate the testing process.

July 24, 2014 12:00 AM

July 23, 2014

Continuum Analytics

Bokeh 0.5.1 Released!

We are excited to announce the release of version 0.5.1 of Bokeh, an interactive web plotting library for Python!

This release includes many bug fixes and improvements over our last recent 0.5 release:

  • Hover activated by default
  • Boxplot in bokeh.charts
  • Better messages when you forget to start the bokeh-server
  • Fixed some packaging bugs
  • Fixed NBviewer rendering
  • Fixed some Unicodeencodeerror

by Damián Avila at July 23, 2014 01:30 PM

July 22, 2014


Webinar: Work Better, Smarter, and Faster in Python with Enthought Training on Demand

  Join Us For a Webinar We’ll demonstrate how Enthought Training on Demand can help both new Python users and experienced Python developers be better, smarter, and faster at the scientific and analytic computing tasks that directly impact their daily productivity and drive results. Space is limited! Click a webinar session link below to reserve […]

by info at July 22, 2014 11:40 PM

Matthieu Brucher

ATKCompressor and ATKExpander Implementation with Audio ToolKit

I’d like to talk a little bit about the way a compressor and an expander can be written with the Audio Toolkit. Even if both effects use far less filters than the SD1 emulation, they still implement more than just one filter in the pipeline, contrary to a fixed delay line (another audio plugin that I released with the compressor and the expander).

Block diagram

So first the block diagram of a compressor (it is the same for an expander):

Compressor pipeline

Compressor pipeline

The pipeline is quite obvious: start by detecting the power of the signal (a simple square with an AR(1) and a memory of 1ms for these plugins), create from this power an amplitude function through the GainCompressionFilter, and pass it through an attack/release filter (modifying the perception of the signal’s amplitude). Then the amplitude gain is multiplied by the actual gain, and that’s it!


Let’s see how they behave. First this is the formula I used to compute the gain for a compressor:

Power to amplitude formula for a compressor

Power to amplitude formula for a compressor

The idea was to use a C1 function for the signal. First everything is relative to the threshold, so the power divided by the threshold is an absolute curve. Now that this is settled, the power is converted to dB (as all curves are usually in that domain) and then an approximation of the absolute() function is used, with its parameter, the softness. If the power in dB is added to this function and then divided by 0, the resulting function will be of value 0 up to a relative power of 0dB, which means a constant amplitude gain of 0. For values higher than 0dB, the gain can be reduced by a factor depending on the slope. The final amplitude gain can be achieved by converting back the result to amplitudes.

Here are a few curves for the compressor:

Compressor gain curves

Compressor gain curves

And the same can be done for the expander:
Expander gain curves

Expander gain curves

The gain filters work on power values, from threshold to slope, and only returns a gain. The script takes an amplitude as input, squares it to produce the power values, processes them and then applies the gain value to the original amplitude to make the traditional graph.


The profile for the basic compressor is quite obvious:

Compressor profile before optimization

Compressor profile before optimization

After tabulating the gain reduction values in a table (the table has to be properly set up for a compressor and an expander), the compressor can be blazingly fast:
Compressor profile after optimization

Compressor profile after optimization


Creating a compressor when you have all elements is easy. Now it is time to create a stereo compressor/expander from these elements!

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at July 22, 2014 07:34 AM

July 19, 2014

Titus Brown

Citing our software and algorithms - a trial

Our lab is part of the ongoing online conversation about how to properly credit software and algorithms; as is my inclination, we're Just Trying Stuff (TM) to see what works. Here's an update on our latest efforts!

A while back (with release 1.0 of khmer) we added a CITATION file to the khmer repository and distribution. This was in response to Robin Wilson's call for CITATION files, and dovetails with some of the efforts of the R and Debian communities.

In the short term, we don't expect many people to pay attention to these kinds of infrastructure efforts, and for much of our work we actually have publications on the algorithms involved. More to the point, our software isn't just software -- it's the instantiation of novel data structures and algorithms, or at least novel applications of data structures. The people who did the research are not necessarily the same people as the developers and maintainers of our software implementation, and we'd like to reward both appropriately with citations.

Additionally, for things like tenure and promotion and grants, often it is the case that only peer reviewed articles count. In this case, having citations accrue to those articles is a good idea!

So, rather than directly citing our tarballs or repositories (see F1000 Research and Mozilla Science Lab's efforts) we have modified our scripts to output the appropriate citation information. For example, if you run 'normalize-by-median.py', you get this output

|| This is the script 'normalize-by-median.py' in khmer.
|| You are running khmer version 1.1-9-g237d5ad
|| You are also using screed version 0.7
|| If you use this script in a publication, please cite EACH of the following:
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
|| Please see the CITATION file for details.

The first citation is the software description, and the second is the normalization algorithm.

Likewise, if you run 'load-graph.py', you will see:

|| If you use this script in a publication, please cite EACH of the following:
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * J Pell et al., PNAS, 2014 (PMID 22847406)
|| Please see the CITATION file for details.

which is our De Bruijn graph paper.

Interestingly, GNU Parallel also provides citation information:

When using programs that use GNU Parallel to process data for publication please cite:

 O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
 ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.

which is pretty cool!

Note also that Michael Crusoe, who works with me on khmer (side note: find Michael some completely over-the-top title - "the khmer software maestro"?), has been working with the Galaxy folk to build citation infrastructure into the Galaxy Tool Shed. Mo' infrastructure mo' bettah.

What's next?

Now that we're starting to provide unambiguous and explicit citation information, it's up to individual actors to cite our software appropriately. That's something we can help with (by mentioning it in e.g. reviews) but I'm not sure how much more we can do in the khmer project specifically. (Suggestions welcome here!)

My biggest unanswered concern in this space is now something completely different: it's providing (and getting) credit for the CS research. For example, there are several implementations of the digital normalization idea -- in silico normalization (in Trinity) and also BBnorm (see here and here). Those are implementations of the underlying idea of normalization, and I (perhaps selfishly) think that in most cases people using the BBnorm or Trinity code should be citing our digital normalization preprint.

This concern emerges from the fact that good algorithm development is largely different from good software development. Many bioinformaticians provide basic proof of concept for an algorithm or a data structure, but do not invest much time and energy in software engineering and community engagement. While this is a great service -- we often do need new algorithms and data structures -- we also need good implementations of data structures and algorithms. Academia tends to reward the DS&A and not the implementation folk, but I think we need to do both, and need to do both separately. Shifting to a system where only the implementers get credit doesn't seem like a great improvement to me ;).

So my thought here is that any tool that uses a research algorithm or data structure developed by others should output citation information for that other work. This follows the advice given by Sarah Callaghan to "cite what you use".

A specific example we're planning: someone is porting some abandoned thesisware to khmer. The citation information will specify both khmer (for the software implementation) and the methods publication (already published) for basic validation information.

I'm not sure where to draw the line, though -- there are clearly cases where the data structures and algorithms have been developed further and our work no longer deserves to be cited in the software, and other cases where the DS&A work may be irrelevant. People using Minia, for example, should not need to cite Pell et al., 2012, because the Minia folk extended our work significantly in their paper and implementation. And, for many instances of k-mer counting, our low-memory k-mer counting work (soon to be published in Zhang et al., 2014) is not necessary -- so if they're using khmer because of its convenient Python API, but not depending in any particular way on low-memory k-mer counting, they probably shouldn't bother citing Zhang et al. Or maybe they should, because of accuracy concerns addressed in that paper?

I'd be interested in your thoughts and experiences here. (Note, I haven't sent anything to the Trinity or BBnorm folk because (a) I haven't thought things through, (b) that'd probably be too aggressive anyway, and (c) we should figure out what the community norms should be first...)


p.s. Thanks to Michael Crusoe and Varsha Khodiyar for reading a preview of this blog post and giving me feedback!

by C. Titus Brown at July 19, 2014 10:00 PM

Preprints and double publication - when is some exposure too much?

Recently I had some conversations with Science Magazine about preprints, and when they're counted as double publication (see: Ingelfinger Rule). Now, Science has an enlightened preprint policy:

...we do allow posting of research papers on not-for-profit preprint servers such as arxiv.org. Please contact the editors with questions regarding allowable postings.

but details are not provided. Now, on Facebook and elsewhere I've seen people be confused about whether (for example) posting a preprint and then discussing it counts as "distribution" -- here, Science muddies the waters by saying,

Distribution on the Internet may be considered prior publication and may compromise the originality of the paper as a submission to Science, although...

(followed by the previous quote).

So, spurred by some recent questions along this vein that a friend asked on Facebook, I followed up with Science. I asked the editors a broad set of questions about when preprints could be publicized on blogs, via Twitter, or on social media sites such as Facebook. Here is their response, which I think provides a valuable and specific set of guidelines for us.

Dear Dr. Brown,

thank you for your detailed questions. In our role as guardians of the veracity and impact of scientific literature, we have long been concerned about how to minimize pre-publication dissemination of scientific information, and we are happy to communicate our deliberations and decisions for you.

We do allow posting to preprint servers, as explicitly noted in our policy. This is to preserve scholarly communication while making sure that the public are not exposed to research conclusions unvalidated by peer review. However, as the line between journalism and Internet opining continues to blur, we are concerned that non-scientists may occasionally be exposed to incorrect research conclusions through discussion of these preprints on the World Wide Web and social media.

We have therefore developed the following guidelines:

First, any researcher (any member of the team) may discuss their preprint in a blog post, as long as their blog is not read by more than 1,000 unique visitors in any given 30-day period.

Second, researchers may convey these preprints to others via e-mail or Facebook. We have designated an upper limit of dissemination to 25 researchers (via e-mail) or 150 "friends" (via Facebook).

Third, a blog post and/or a link to the preprint may be publicized via Twitter or other social media Web sites, as long as it is not "reblogged" (via retweet) to more than 5,000 total individuals or acknowledged (e.g. "favorited") by more than 150.

We believe that these numbers provide a fair and equitable balance between protecting the scientific literature from undue dissemination of research results, and allowing scientists to discuss their work with friends. We hope you agree.

Please note that we have contracted with the National Security Agency as part of their new Freedom and Liberty Outreach Program to make sure these dissemination limits are observed. We continue to reserve the right to reject any paper with or without consideration for any reason, including when researchers violate these limits.


[ redacted ]

They've said they welcome feedback at flop-program@sciencemag.org, so please go ahead and send them your thoughts.


by C. Titus Brown at July 19, 2014 10:00 PM

July 18, 2014

Juan Nunez-Iglesias


I just got back home from the SciPy 2014 conference in Austin, TX. Here are my thoughts after this year’s conference.

About SciPy in general

Since my first SciPy in 2012, I’ve decided to prioritise programming conferences over scientific ones, because the value I get for my time is just that much higher. At a scientific conference, I certainly become aware of way-cool stuff going on in other research labs in my area. Once I get home, however, I go back to whatever I was doing. In contrast, at programming conferences, I become aware of new tools and practices that change the way I do science. In his keynote this year, Greg Wilson said of Software Carpentry, “We save researchers a day a week for the rest of their careers.” I feel the same way about SciPy in general.

In the 2012 sprints, I learned about GitHub Pull Requests and code review, the lingua franca of open source development today. I can’t express how useful that’s been. I also started my ongoing collaboration with the scikit-image project, which has enabled my research to reach far more users than I ever could have achieved on my own.

No scientific conference I’ve been to has had such an impact on my science output, nor can I imagine one doing so.

This year’s highlights

This year was no different. Without further ado, here are my top hits from this year’s conference:

  • Michael Droettboom talked about his continuous benchmarking project, Airspeed Velocity. It is hilariously named and incredibly useful. asv checks out code from your Git repo at regular intervals and runs benchmarks (which you define), and plots your code’s performance over time. It’s an incredible guard against feature creep slowing down your code base.
  • IPython recently unveiled their modal version 2.0 interface, sending vimmers worldwide rejoicing. But a few key bindings are just wrong from a vim perspective. Most egregiously, i, which should enter edit mode, interrupts the kernel when pressed twice! That’s just evil. Paul Ivanov goes all in with vim keybindings with his hilarious and life-changing IPython vimception talk. The title is more appropriate than I realised. Must-watch.
  • Damián Avila revealed (heh) his live IPython presentations with Reveal.js, forever changing how Python programmers present their work. I was actually aware of this before the conference and used it in my own talk, but you should definitely watch his talk and get the extension from his repo.
  • Min RK gave an excellent tutorial on IPython parallel (repo, videos 1, 2, 3). It’s remarkable what the IPython team have achieved thanks to their decoupling of the interactive shell and the computation “kernel”. (I still hate that word.)
  • Brian Granger and Jon Frederic gave an excellent tutorial on IPython interactive widgets (notebook here). They provide a simple and fast way to interactively explore your data. I’ve already started using these on my own problems.
  • Aaron Meurer gave the best introduction to the Python packaging problem that I’ve ever seen, and how Continuum’s conda project solves it. I think we still need an in-depth tutorial on how package distributors should use conda, but for users, conda is just awesome, and this talk explains why.
  • Matt Rocklin has a gift for crystal clear speaking, despite his incredible speed, and it was on full display in his (and Mark Wiebe’s) talk on Blaze, Continuum’s new array abstraction library. I’m not sure I’ll be using Blaze immediately but it’s certainly on my radar now!
  • Lightning talks are always fun: days 1, 2, 3. Watch out for Fernando Pérez’s announcement of Project Jupyter, the evolution of the IPython notebook, and for Damon McDougall’s riveting history of waffles. (You think I’m joking.)

Apologies if I’ve missed anyone: with three tracks, an added one with the World Cup matches ;) , and my own talk preparations, “overwhelming” does not begin to describe the conference! I will second Aaron Meurer’s assertion that there were no bad talks. Which brings us to…

On my to-watch

Jake Vanderplas recently wrote a series of blog posts (last one here, with links to earlier posts) comparing frequentist and Bayesian approaches to statistical analysis, in which he makes a compelling argument that we should all relegate frequentism to the dustbin of history. As such, I intend to go over Chris Fonnesbeck’s tutorial (2, 3) and talk about Bayesian analysis in Python using PyMC.

David Sanders also did a Julia tutorial (part 2) that was scheduled at the same time as my own scikit-image tutorial, but I’m interested to see how the Julia ecosystem is progressing, so that should be a good place to start. (Although I’m still bitter that they went with 1-based indexing!)

The reproducible science tutorial (part 2) generated quite a bit of buzz so I would be interested to go over that one as well.

For those interested in computing education or in geoscience, the conference had dedicated tracks for each of those, so you are bound to find something you like (not least, Lorena Barba’s and Greg Wilson’s keynotes). Have a look at the full listing of videos here. These might be easier to navigate by looking at the conference schedule.

The SciPy experience

I want to close this post with a few reflections on the conference itself.

SciPy is broken up into three “stages”: two days of tutorials, three days of talks, and two days of sprints. Above, I covered the tutorials and talks, but to me, the sprints are what truly distinguish SciPy. The spirit of collaboration is unparalleled, and an astonishing amount of value is generated in those two days, either in the shape of code, or in introducing newcomers to new projects and new ways to collaborate in programming.

My biggest regret of the conference was not giving a lightning talk urging people to come to the sprints. I repeatedly asked people whether they were coming to the sprints, and almost invariably the answer was that they didn’t feel they were good enough to contribute. To reiterate my previous statements: (1) when I attended my first sprint in 2012, I had never done a pull request; (2) sprints are an excellent way to introduce newcomers to projects and to the pull request development model. All the buzz around the sprints was how welcoming all of the teams were, but I think there is a massive number of missed opportunities because this is not made obvious to attendees before the sprints.

Lastly, a few notes on diversity. During the conference, April Wright, a student in evolutionary biology at UT Austin, wrote a heartbreaking blog post about how excluded she felt from a conference where only 15% of attendees were women. That particular incident was joyfully resolved, with plenty of SciPyers reaching out to April and inviting her along to sprints and other events. But it highlighted just how poorly we are doing in terms of diversity. Andy Terrel, one of the conference organisers, pointed out that 15% is much better than 2012’s three (women, not percent!), but (a) that is still extremely low, and (b) I was horrified to read this because I was there in 2012… And I did not notice that anything was wrong. How can it be, in 2012, that it can seem normal to be at a professional conference and have effectively zero women around? It doesn’t matter what one says about the background percentage of women in our industry and so on… Maybe SciPy is doing all it can about diversity. (Though I doubt it.) The point is that a scene like that should feel like one of those deserted cityscapes in post-apocalyptic movies. As long as it doesn’t, as long as SciPy feels normal, we will continue to have diversity problems. I hope my fellow SciPyers look at these numbers, feel appalled as I have, and try to improve.

… And on cue, while I was writing this post, Andy Terrel wrote a great post of his own about this very topic:


I still consider SciPy a fantastic conference. Jonathan Eisen (@phylogenomics), whom I admire, would undoubtedly boycott it because of the problems outlined above, but I am heartened that the organising committee is taking this as a serious problem and trying hard fix it. I hope next time it is even better.

by Juan Nunez-Iglesias at July 18, 2014 01:48 AM

July 17, 2014

Titus Brown

Charting the future of data science at the NIH -- topics to discuss?

In September, I will be visiting the NIH to "chart the next 5 years of data science at the NIH." This meeting will use an open space approach, and we were asked to provide some suggested topics. Here are five topics that I suggested, and one that Jeramia Ory suggested (the last one) in response to my posting these on Twitter.

Additions? Thoughts? Suggestions?

  1. Education and training in data science: from undergrad through independence
  2. Fostering the careers of data scientists in biomedical research
  3. Building sustainable cyberinfrastructure to support data intensive research in biomedical sciences
  4. Supporting and incentivizing a transition to more open science
  5. 5 years out: now that we have all the data, what do we do next?
  6. How do we support/train curators or a culture of curation? (Good curation is a bottleneck to insight.)


by C. Titus Brown at July 17, 2014 10:00 PM

Andy Terrel

Diversity at SciPy2014

SciPy2014 logo

SciPy2014 is over, but there is so much more to do. I expect to be reviving my blog with a set of prose on my experience. Today I want to talk about a topic that was dear to my heart when I started this blog, diversity in the scientific python ecosystem. I'm focusing on gender diversity, but we also need to be aware of racial and other diversities, as well. I hope helping one will lead to helping others.

For the record, I am writing this post from my personal perspective. It is not representative of any company or community that has graciously chosen me as a representative.

SciPy, we have a diversity problem and it's not anyone else's job to fix it. We've been trying to fix this problem, but this has uncovered other problems with our methodologies. It is my sincere hope that this amazingly brave post by April Wright and follow up will help shape the future of our community. Thank you, April, your courageous anecdote has been hugely impactful ... if the community listens.

Working on diversity at SciPy

First let me outline some of the efforts that have been taken in the past three years that I have been involved with organizing the SciPy Conference. Even getting these seemingly trivial changes has been a challenge I never expected. I've had community members tell me I'm wasting my time, others tell me not enough is being done, and the larger part of the community a bit uninterested in the topic. Folks, SciPy is a community conference, it is on us to work hard to make it the community we want and I don't want to be in an exclusive white male club any longer.

First, in 2012, Matt Davis noted via twitter that more men had walked on the moon than women attending SciPy. How dare he say such slander! ... wait let's count how many women are here ... I only see 3 ... très ... ¡Qué hostias!

The first thought on how to change was to find more women to be on our organizing committee and keynotes. Becoming chair in 2013, I asked every woman I knew in the community to help out. We increased to 8 out of 58 in our organizers. Jonathan and I asked a half dozen women to keynote, but were roundly rejected. Not only do we have a problem with diverse groups coming to the conference, they have to be heavily persuaded to be a part of the organization at all. I could send one or two emails to community men I knew and get a avid contributor. This is the hill we must climb.

In 2013, we also had a social funded by NumFOCUS and the PSF. The result was about 20 women of a variety of races coming together to talk about the issues. The message the organizers took home was cast a wider net. There is amazing quality from women in scientific computing, but they will not be looking at conferences that are primarily white men. We are also a very small growing conference (150 in 2012 to 460 in 2014) so people outside our clique aren't going to come for prestige. Also the overwhelming opinion was, a social is nice, but less alcohol more food.

This year, 2014, we did similar things as last. Increase women on the organizing committee (19 of 84), added a diversity committee, asked my uber-prof friend to give one of the best SciPy keynotes ever and host an event for diversity. We also worked with NumFOCUS to fund scholarships explicitly for women in technology through a generous donation from JP Morgan. This funding helped 5 more women attend the conference, in addition to the 15 other students (which included 3 women) supported to attend. We estimated 15% of attendees were women, not great but better than that original 1.5% in 2012.

The conference also had several discussions on the topic during the conference.

Additionally, I opened up the mailing list for the conference organization with the hope of encouraging more participation from the community. The efforts to take the conference as open as possible is one other effort to build the community we want. Being open isn’t sufficient to be diverse, it’s a necessary but not even close to sufficient condition.

Needed Changes

First, April hit the nail on the head. Who cares if you do all the things above if your conference is not welcoming to everyone. Below, I outline a few specific actions the SciPy2015 community can take.

  • Welcome to SciPy track in all event categories
  • Better social events to get to know each other
  • Cast wider nets via smaller user groups
  • Teach the community how to run events better
  • Have a welcome committee to help people know who we are.

I've discussed this with many many community members and SciPy2015 will have a "Welcome to SciPy" arc to help folks know who we are, what tools we use, and why. The comment that it is a great conference if you already know all the things to succeed with SciPy has been repeated often.

Additionally, we have discussed quite a bit about having social events that are not alcohol or food related. There have always been smaller events not organized by the conference but we need to encourage this activity more broadly. This year if you watched twitter, you would have found folks swimming, running, and climbing, but having this in a program would be a big help.

As Katy pointed out we need to advertise our activities more. Some folks are chatting about starting meetups around the SciPy community. A small way to make sure folks are seeing local faces more often.

Sheila Miguez pointed out the incredible in-person event handbook from Shauna G. of Open Hatch. I think taking up the principles in this handbook is really needed. We have not made welcoming, goal setting, and clarifying structures a priority at events.

In the past, I have held events for other orgs where we made sure that a small set of folks were explicitly tasked with shaking hands and talking to people who were alone. It seems foolish to some, but making folks break from the clique and talk to others only helps build the community.

We Need You

Finally, SciPy, we need you. Today all these events happen because of a few concerned volunteers. We need more people helping with our events, advocating for the minority, and building the community.

Community building doesn't fit nicely on a resume nor does it always feel good, but it is critical to everyone. You will be depressed after you read a blog showing how far our community has to go. You will have colleagues tell you that you are wasting your time. You will miss deadlines that affect your funding or day job. Without community we will not sustain.

When I sat down with folks to encourage them to join our executive committee, I typically made the argument that if we don't build the community we want, no one else will. There are a lot of challenges we need to face, e.g. promoting code as a real scholarly work and promoting an open sharing culture among coders in scientific computing. We cannot do this without a diverse functional community.

Please go sign up for the mailing list and start making this a better place. http://mail.scipy.org/mailman/listinfo/scipy-organizers


Thank you to: April Wright, Kelsey Jordahl, Anthony Scopatz, Matthew Turk, Dan Katz, Sheila Miguez, and Kyle Mandli for reading an early version and edits to this document.

by Andy R. Terrel at July 17, 2014 07:00 AM

July 16, 2014

Luis Pedro Coelho


Tonight, I’m doing a Python for Image Analysis tutorial in Heidelberg. See the meetup page for information. Materials are at: https://github.com/luispedro/python-image-tutorial


by luispedro at July 16, 2014 12:13 PM

July 15, 2014

Gaël Varoquaux

Scikit-learn 0.15 release: lots of speedups!

We have just released the 0.15 version of scikit-learn. Hurray!! Thanks to all involved.

A long development stretch

It’s been a while since the last release of scikit-learn. So a lot has happened. Exactly 2611 commits according my count.

Quite clearly, we have more and more existing code, more and more features to support. This means that when we modify an algorithm, for instance to make it faster, something else might break due to numerical instability, or exploring some obscure option. The good news is that we have tight continuous integration, mostly thanks to travis (but Windows continuous integration is on its way), and we keep growing our test suite. Thus while it is getting harder and harder to change something in scikit-learn, scikit-learn is also becoming more and more robust.


Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

by gael at July 15, 2014 01:25 PM

Matthieu Brucher

Announcement: ATKCompressor 1.0.1, ATKExpander 1.0.1

I’m happy to announce the release of three new audio plugins based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats.





The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

The files can be downloaded on SourceForge, as well as the source code.

Direct link for ATKCompressor
Direct link for ATKExpander

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at July 15, 2014 07:46 AM

July 14, 2014

Titus Brown

How we make our papers replicable, v0.8 (beta)

  1. Create a github repository named something like '2014-paper-xxxx'. Ask me for name suggestions.

    In that github repo, do the following:

  2. Write a Makefile or some other automated way of generating all results from data - see


    or ask Camille (@camille_codon) what she is using. The goal here is to go from raw data to directly interpretable data in as automated a fashion as possible.

    If queue submission or other scripts are involved, they should run the commands in the Makefile. That is, all commands should (as much as possible) be in one place, and how they are executed is a bit less important.

  3. Write a README explaining the above. For example, see:


    this README should describe versions of software and (as much as possible) give instructions that will work on Ubuntu 14.04. It's OK if not all the commands will run on any given amazon machine (e.g. because of memory limitations); what's important is to have them all specified for a specific OS version, not whatever bastardized version of Linux (e.g.) our HPC uses.

  4. Write an IPython Notebook that generates all the figures - see


    for an example. This should take all of the data from the pipeline makefile (#1, above) and turn it into the production ready figures that go into the paper.

  5. Write the paper in LaTeX, if at all possible, and write a Makefile to generate the final output file. This is what we will submit to arXiv, submit to review, etc.



    for an example here.

Specific examples:

Pell et al., PNAS, 2012: https://github.com/ged-lab/2012-paper-kmer-percolation

Brown et al., arXiv, 2012: https://github.com/ged-lab/2012-paper-diginorm

Zhang et al., PLoS One, 2014: https://github.com/ged-lab/2013-khmer-counting

Feel free to deviate from the above but expect questions. Tough questions.


by C. Titus Brown at July 14, 2014 10:00 PM

Richard Tsai

Rewrite scipy.cluster.hierarchy

After rewriting cluster.vq, I am rewriting the underlying implementation of cluster.hierarchy in Cython now.

Some Concerns about Cython

The reason why I use Cython to rewrite these modules is that Cython code is more maintainable, especially when using NumPy’s ndarray. Cython provides some easy-to-use and efficient mechanisms to access Python buffer memory, such as Typed Memoryview. However, if you need to iterate through the whole array, it would is a bit slow. It is because Cython will translate A[i, j] into something like *(A.data + i * A.strides[0] + j * A.strides[1]), i.e. it needs to calculate the offset in the array data buffer on every array access. Consider the following C code.

int i, j;
double *current_row;

/* method 1 */
s = 0;
current_row = (double *)A.data;
for(i = 0; i < A.shape[0]; ++i) {
    for(j = 0; j < A.shape[1]; ++j)
        s += current_row[j];
    current_row += A.shape[1];

/* method 2 */
s = 0;
for(i = 0; i < A.shape[0]; ++i)
    for(j = 0; j < A.shape[1]; ++j)
        s += *(A.data + i * A.shape[1] + j);

The original C implementation uses method 1 shown above, which is much more efficient than method 2, which is similiar to the C code that Cython generates for memoryview accesses. Of course method 1 can be adopted in Cython but the neatness and maintainablity of Cython code will reduce. In fact, that is just how I implemented _vq last month. But _vq has only two public functions while _hierarchy has 14, and the algorithms in _hierarchy are more complicated that those in _vq. It would be unwise to just translate all the C code into Cython with loads of pointer operations. Fortunately, the time complexities of most functions in _hierarchy are \(O(n)\). I think the performance loss of these functions is not a big problem and they can just use memoryview to keep maintainablity.

The New Implementation

The following table is a speed comparision of the original C implementation and the new Cython implementation. With the use of memoryview, most functions have about 30% performance loss. I used some optimization strategies for some functions and they run faster than the original version. The most important function, linkage, has a 2.5x speedup on a dataset with 2000 points.


The new implementation has yet to be finished. All tests pass but there may be still some bugs now. And it lacks documentations now.

by Richard at July 14, 2014 09:19 AM

July 13, 2014


Hi all,

On the behalf of Spyder's development team, I'm pleased to announce that Spyder 2.3 has been released and is available for Windows XP/Vista/7/8, GNU/Linux and MacOS X here: https://bitbucket.org/spyder-ide/spyderlib/downloads

This release represents 14 months of development since version 2.2 and introduces major enhancements and new features:

  • Python 3 support (versions 3.2, 3.3 and 3.4 are supported).
  • Drop Python 2.5 support.
  • Various Editor improvements:
    • Use the Tab key to do code completions
    • Highlight cells
    • First-class support for Enaml files
    • Improve how calltips are shown
    • Better looking Object Inspector
Spyder 2.2 has been a huge success (being downloaded more than 400,000 times!) and we hope 2.3 will be as successful as it. For that we fixed 70 important bugs, merged 30 pull requests from 11 authors and added almost 1000 commits between these two releases.

- Carlos

by Carlos (noreply@blogger.com) at July 13, 2014 10:18 PM

Issam Laradji

(GSoC 2014) Progress report for 07/13/14

I completed the requirements for GSoC 2014, except for the documentation which I am leaving for the remaining duration of the program. Since the mid-term evaluation I implemented the following,

1) Regularized and Weighted Extreme Learning Machines (ELMs);

2) Sequental Extreme Learning Machines;

3) Kernels for ELMs; and

4) Relevant test cases and examples.

I will explain the implementations in more detail below.

1) Regularized and Weighted ELMs

Assuming H is the hidden activations, $\beta$ is the hidden layer outgoing weights, and y is the target vector; regularized ELMs solves the following equation,

$\beta = (HH' + I/C)'Hy$
where I is the identity matrix.

The regularization term C determines the decision boundary degree of linearity. Figure 1 shows how regularization - or reducing C - leads to a linear function.

(Figure 1: non-regularized (left side) vs. regularized (right side) decision boundary)

Weighted ELMs is different from regularized ELMs, in that a diagonal weight matrix $W$ is added to the equation, yielding the following,

$\beta = (HWH' + I/C)'HWy$

Index $(i, i)$ in $W$ corresponds to how much weight is given to sample $i$, depending on the sample's class. This scheme is used to address the problem with imbalanced datasets; where a class is underrepresented by having few samples compared to other classes and therefore ignored by classifiers. Such minority classes are given higher weights such that the decision boundary is pushed away from them. Figure 2 shows the difference between applying weighting schemes for the minority class, the orange samples.

                   (Figure 2: no weights (left side); 0.618/(#samples) weight given to each class                                (middle side); 1000 weight cost given to the orange class (right side))

2) Sequential ELMs

Dealing with million sample datasets is problematic when they have to be in memory all at once for training. Sequential ELMs mitigates this limitation by breaking the dataset into batches and trains on them by per-batch basis using a recursive equation which is but a subtle representation of ELM's original equation. Unfortunately the implementation does not support weights yet.

3) Kernels for ELMs
The standard initialization of ELM input weights is the result of a random kernel. However, other kernels, which are best known for training SVMs, can be used to get new hidden activations - like Radial Basis Function, Linear kernel, and the Polynomial kernel.

For the remaining time of GSoC 2014, I will complete the ELMs documentation and add any necessary changes for the completed work.

by noreply@blogger.com (Issam Laradji) at July 13, 2014 09:07 PM

July 12, 2014

Hamzeh Alsalhi

Sparse Target Data for One vs. Rest Classifiers

Going forward with sparse support for various classifiers I have been working on a pull request for sparse one vs. rest classifiers that will allow for sparse target data formats. This will results in a significant improvement in memory usage when working with large amount of sparse target data, a  benchmark is given bellow to measure the. Ultimately what this means for users is that using the same amount of system memory it will be possible to train and predict with a ovr classifier on a larger target data set. A big thank you to both Arnaud and Joel for the close inspection of my code so far and the suggestions for improving it!


The One vs. Rest classier works by binarizing the target data and fitting an individual classifier for each class. The implementation of sparse target data support improves memory usage because  it uses a sparse binarizer to give a binary target data matrix that is highly space efficient.

By avoiding a dense binarized matrix we can slice the one column at a time required for a classifier and densify only when necessary. At no point will the entire dense matrix be present in memory. The benchmark that follows illustrates this.


A significant part of the work on this pull request has involved devising benchmarks to validate intuition about the improvements provided, Arnaud has contributed the benchmark that is presented here to showcase the memory improvements.

By using the module memory_profiler we can see how the fit and predict functions of the ovr classifier affect the memory consumption. In the following examples we initialize a classifier and fit it to the train dataset we provide in one step, then we predict on a test dataset. We first run a control benchmark which shows the state of one vs. rest classifiers as they are without this pull request. The second benchmark repeats the same steps but instead of using dense target data it passes the target data to the fit function in a sparse format.

The dataset used is generated with scikit-learns make multilabel classification, and is generated with the following call: 
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse=True, return_indicator=True,
n_samples=20000, n_features=100,
n_classes=4000, n_labels=4,

This results in a densely formatted target dataset with a sparsity of about 0.001

Control Benchmark

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 179.824 MB
consumes -73.969 MB. The negative value indicates that data has been deleted from memory.

Sparse OvR PR Benchamrk

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 27.426 MB 

consumes 0.180 MB 


Considering the memory consumption for each case as 180 MB and 30 MB we see a 6x improvement in peak memory consumption with the data set we benchmarked.

Upcoming Pull Request

The next focus in my summer of code after finalizing the sparse one vs. rest classifier will be to introduce sparse target data support for the knn and dummy classifiers which have built in support for multiclass target data. I have begun the knn pull request here. Implementing a row wise mode calculation for sparse matrices will be the main challenge of the knn PR.

by noreply@blogger.com (Hamzeh) at July 12, 2014 01:53 AM

July 10, 2014

Titus Brown

Talk notes for the Bioinformatics Open Source Conference (2014)

These are the talk notes for my opening talk at the 2014 Bioinformatics Open Source Conference.

Normally my talk notes aren't quite so extensive, but for some reason I thought it would be a good idea to give an "interesting" talk, so my talk title was "A History of Bioinformatics (in 2039)". My thought was that this could be fun way to pretend to "look back" on bioinformatics and make some predictions about what would go wrong (and right). In hindsight, this was stupidly ambitious and likely to fail miserably, but I gave it my best shot and put together a bunch of notes over a period of several months. And... here you are!

You can also see my presentation.

Datapocalypse --

Lots of data

Field optimized for data gathering and hyp driven investigation, not
data exploration.

Computational infrastructure built to service HPC, not HTC.

Queries across data sets needed.

Reproducibility crisis --

Computationally speaking, surprisingly easy to resolve, but willpower and incentives not there.

This started to change in late teens, as more and more papers proved to be irreproducible; broader than computing, of course, but computing led the way.

Ultimately required a shift in how recognition was awarded: we no longer recognize the first to report something until after that report has been reproduced.

Reviewers frown on building off of unverified work.

Grants often start by promising to validate previous work.

Computing in biology --

simply put, there weren't enough people being trained in the teens. Graduate students would arrive at their research labs and be unable to work with existing data or new data they generated. Easy to use tools were created, but it turned out that these tools embodied so many assumptions that they ultimately had to be abandoned, except in well-understood clinical scenarios. As you know, now we spend a lot more time training beginning researchers in basic programming, data analysis, modeling, and building computational protocols.

However, there was a sad period in the late 'teens and early 20s where what we called bioinformatics sweatshops predominated. These sweatshops consisted of low-paid students who ended up with no career path and no significant authorship. In practice anyone with ambition left with a terminal degree and moved to California where they could have a career, leaving only people who could run programs they didn't understand.

(expand) Huge amounts of bio; Somewhat fewer tools; Competence needed to apply tools appropriately; Training gap; completely underappreciated set of expertise.

This led to a several year a "correctness crisis" - at the same time as biology was becoming more and more computational, fewer and fewer reviewers were left to review more and more papers written by poorly trained computational scientists

In recognition of the cultural and career problems, a union of bioinformaticians formed with a few basic principles.

1) All of the data and source code has to be provided for any computational part of a biology paper.

  1. Methods and full methods sections need to included in review.

3) If an unpublished method was used in analysis of new data, a separate and thorough description and analysis must be made available at the time of peer review.

Any papers that did not meet these rules were simply rejected without review. It took several years for these rules to spread, and it will not surprise you that the MS Elsevier, back then known as Elsevier, was the last to buckle. However, we, the bioinformatics union, stood our ground and pointed out that as long as we were doing the reviewing we could set review guidelines in our area of expertise, as the statisticians had done.

It's worth pointing out that in this regard an interesting healthy attitude was created where public peer reviews of papers were examined for poor peer review, and then badly peer reviewed papers were nominated for online workshops where they were re-reviewed and subjected to replication. There were several journals at the time that were devoted to publishing purely computational replications, and I know that many of the more senior scientists in the crowd owe your career in some part to those efforts.

This culture of replication has stood the field in good stead: in addition to considerably raising the quality of peer review, a community of practice has emerged around the concept of replication, and new students are immediately educated in how to write replicable papers.

Finally, this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption. We may never know the details of what happened, given the zonal interdiction and lingering contamination, but I note the irony that so much effort was made to replicate data globally, while so many of the worlds' data scientists remained physically clustered within the former Bay Area.

All of this led to a renaissance in biology, briefly interrupted by the economic crunch of the mid-2020s and the First International Police Action. I would like to highlight four underpinnings of this renaissance: first, the biomedical enterprise rediscovering non-translational research; second, the rise and ultimate triumph of open science; third, the belated recognition that networked science was the future; and fourth, and perhaps most important, a massive investment in people and training.


The rediscovery of basic biology --

sequencing enabled!

ecology, evolution, population genetics, and microbial interactions

vetmed and ag

significant investment in a variety of new model organisms, especially vet

Triumph of open science -- by 2018, bioinformatics and genomics had already figured this out, but not experimental biology. This led to what I think of as the Curious story of exp biology. It took about 10 years, but, the biotech realization that most pure research could not be replicated and most research money was simply burned on experiments that never made it out of the lab, combined with a funding crunch and a wave of retirements, meant that many of the more hyp driven labs simply couldn't compete. Experimentalists had to reinvent themselves in two ways: first, guide their experiments with significant up front data analysis (see: human resources problem); and second, make all their data maximally available. By a mere decade ago, in 2030, this had renovated the field of experimental biology by introducing a wide range of data-driven modeling and model-driven data analysis approaches, which continue to further drive experimental research. In a sense, our hypotheses simply became better, more aware of what was out there.

These transitions were enabled by the collapse of the majority of universities, and the accompanying retirement of the vast majority of full professors. While this also led to a massive and unfortunate brain drain, it enabled the third element of the renaissance: a transition to networked science. No longer was science held back by technologically challenged full professors; instead, a huge variety of collaboration and data sharing tools, as well as approaches to distributed team science, emerged.

My favorite was the walled-garden research collaboration, which is now obsolete but at the time was quite radical. This emerged from pioneering work done by Sage Bionetworks in the early 2010s, where a group of scientists agreed to make data available within the group and published on it simultaneously. This coopetitive model, taken originally from the ocean cruise community and applied to biomedical work, was further enriched in a move inspired by the early Ft Lauderdale data sharing agreements: anyone could have access to the data as long as they agreed to publish after the rest of the group. Nowadays, of course, all data is simply published as soon as it's generated, but at the time this was quite radical and helped drive data sharing.

This is not to say that there have not been major disappointments.

As many of you know, we still have no idea what more than 50% of the gene families discovered in microbes actually do, although biogeochemistry and synthetic biology have characterized the majority of habitat-specific genes.

Career paths still uncertain.

And we now have a problem with glam data sets, that mirrors what some of you may remember as problems with glam mags.

Cancer is at least cured, but: Some of the more complex diseases, esp neurodegenerative, are uncured; there seems to be a complex mixture of genetic redundancy and phenotypic plasticity underlying anything to do with the brain.

Essentially, complex phenotypes are still hard to decipher because of their Rube Goldberg-esque nature & our inability to constrain them with our data gathering skills. This is where more experiments and better model systems will surely help.

Let me now turn to the one of the reasons the organizers invited me. As you no doubt saw, President Rao has announced a new initiative called BRAIN2050. This initiative comes just about 25 years after the first BRAIN initiative, by president Obama, and it is an ambitious series of funding proposals to understand the brain by 2050. The general focus is on neurodegen diseases, brain regeneration, and understanding the development of human intelligence. I have been invited to sit on the advisory board, and I have some advice to offer this nascent field.

This first one will have you all groaning, but as you know from some of the high profile journal rejections recently, correlation does not imply causation! The original BRAIN effort focused on recording activation patterns in as many neurons as possible, but it turns out that the causal linkages between neurons were simply too complicated to decipher via observation. Fine-tuned transcranial magnetic stimulation promised perturbation mechanisms, but as some of you may remember that research had to be discontinued for ethical reasons. So we're still left with deciphering really large data sets. Here, it seems that modeling may finally offer a solution in terms of ever more refined hypothesis generation, but of course this requires significant effort at an experimental level as well as a computational level and it's simply a very hard problem.

This brings me to modeling. Top-down and bottom-up modeling have both proven to be very successful in some circumstances in the brain research community, and I think my main advice would be to continue straight on! If there's one regret that I have in the more organismal/biomedical community it's that modeling has not been pursued thoroughly over the last quarter century.

I also have some advice concerning reproducibility. While it is obvious that results that cannot be independently replicated are not science, we have to allow some leeway in fast moving fields. I feel like we have become to rigid in our review, and it's reduced our ability to follow up on experimentally time consuming research; the requirement that every experiment and computation be replicated completely independently simply takes too long. In light of all the horrible reproducibility problems of the late 'teens, this is understandable, but I would like to suggest that we adapt the now-venerable concept of technical debt: as long as labs can collaborate to replicate research done in one lab in another, we should allow this for exploratory research. However, this should be clearly marked as a situation where completely independent replication has not yet been done. The concept of technical debt here is simply that we can accrue "replication debt"; this debt must be paid off at some point by articulating the protocols and techniques in finer detail, but for the basic publication we can perhaps allow less rigid replication requirements.

As a field that is about to receive a major infrastructure investment, I would strongly suggest investing up front in a range of experimental collaboration approaches. While existing tools do a great job of supporting sharing of data and code, this is still dependent on personal relationships to drive collaboration. There has been little push to discover processes that actually drive collaboration.

25 years ago, Phil Bourne had the idea of automatically connecting researchers based on automated data analysis "connecting the dots" between researcher projects, to find those unexpected linkages that might prove to drive more collaborations. I think that's a really good idea, and one that has been surprisingly ignored. We now have the informatics, the infrastructure, and the people - why not make use of them?

Another suggestion is to explicitly "tool up" early on. As data gathering begins, recruit theoreticians to start building models, and recruit experimentalists to begin developing model systems. Keep building and renewing a core set of common data sets, all around the same well-studied science problem; this will provide a common core for evaluation and comparison of tools. Contests have proven to be stifling of innovation, because of the tendency to do incremental improvement; but if you can build a culture of tool evaluation and comparison early on by funding it and discussing it regularly, you can reap the rewards later on.

In molecular biology, we've also built an excellent pyramid diagram of software, with a set of commercial data analysis interfaces based around well-understood algorithms and test data sets.

The neuroscience field should also follow the same strategy as the more basic biomedical sciences, and invest heavily at the high school and pre-master's level. Just to remind everyone, there are two reasons for this -- one is that yesterday's high school students are tomorrow's researchers. The other is that investment in science teaching turns out to drive public support for science; 15 years ago, we invested heavily in teaching mechanisms of evolution at the high school level, and we now have an unprecedented 80% acceptance rate of evolution among the 30- and-under set. It will take some time for this to spread through the population, but we can already see an increase in popular support for biological sciences, generally.

Ultimately we need a rich interaction between theory, experiment, and data analysis. In the heyday of genomic sequencing, we suffered from too many physicists, who wanted to build reductionist models in the absence of data. Along with genomics, neuroscience has evicted many of these physicists over the last quarter century in favor of home-grown modelers who are deeply embedded in neuroscience. But we also need good interactions between experiments, theory, and data analysis. [ ...... ]

At the end of the day, it's clear that biology more generally and neuroscience specifically share the feature of being "networks" - molecular interactions, tissue interactions, species interactions, and neuronal interactions. Just as the single-molecule/whole-chromosome sequences from the late 20s didn't magically provide understanding, neither will high resolution neuronal sampling. Reductionist approaches are necessary but cannot tell us all that we need to know.


by C. Titus Brown at July 10, 2014 10:00 PM

July 09, 2014

Continuum Analytics

Bokeh 0.5 Released!

We are excited to announce the release of version 0.5 of Bokeh, an interactive web plotting library for Python!

This release includes some major new features: support for Widgets and Dashboards, preliminary integration of Abstract Rendering for big data visualization, and a new schematic bokeh.charts interface for very high level statistical charting. Additionally, tighter Pandas integration makes using Bokeh with the tools we all love even easier, and a new, much simpler bokeh.embed module makes including Bokeh content in your documents a snap. Over forty bugfixes and several new smaller features are also included: log axes, minor ticks, a new plot frame, gears and gauges glyphs, an npm package for BokehJS, and many usability improvements.

by Bryan Van de Ven at July 09, 2014 11:30 AM

July 08, 2014

Matthieu Brucher

Announcement: Audio TK 0.4.0

A lot has happened in two weeks for Audio ToolKit. This release mainly adds tools for Compressor and Delays design. There will be additional plugins release soon.

Download link: ATK 0.4.0


* Added a white noise generator filter
* Fixed the delay line filters
* Fixed second order EQ Python wrappers

* Added a fixed delay line filter (FIR filter) with Python wrappers
* Added a universal fixed delay line filter (FIR, IIR and all-pass filter + combinations) with Python wrappers
* Added variable delay filters (delay is given on an additional port)

* Adding compressor elements and the Python wrappers
* Fixed Python wrappers by adding default number of port/channels
* Added restrict pointer support in all filter that can benefit from this

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at July 08, 2014 07:13 AM

Sam Stephens

Starting to Learn Django
Currently working on learning django. It does not seem to tricky but like with any package the devil is in the details. I will let you know how it goes.

by Sam's Place (noreply@blogger.com) at July 08, 2014 07:08 AM

July 07, 2014

Maheshakya Wijewardena

Optimizations on Locality sensitive hashing forest data structure

In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, optimization was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.

  1. Memory usage of the data structure.
  2. Speed of queries.

Let's discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.

Optimizations on memory usage

In the previous post about the implementation of the LSH forest, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points.

Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (1s and 0s). At the moment, these hashes are stored as binary strings.

One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string 001110 can be represented as 14. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set.

As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be 32 hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create \(g(p)\)(refer to the LSH Forest paper). Because the hashing is done separately for individual elements of \(g(p)\), they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time.

In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that "you cannot eliminate the memory speed trade-off completely" holds.

Optimizations on query speeds

(I strongly recommend you to go through the LSH forest paper before reading the following section because the description involves terms directly taken from the paper)

I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.

Optimizations on the algorithm

In the initial implementation, binary hash: \(g(p)\) for each tree is computed during the synchronous ascending phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For Random projections, it is a dot product and it is an expensive operation for large vectors. descending phase for a single tree Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10).

def descend(query, hash_functions, trees):
    bin_queries = []
    max_depth = 0
    for i in range(number_of_trees):
        binary_query = do_hash(query, hash_functions[i]
        k = find_longest_prefix_match(trees[i], binary_query)
        if k > max_depth:
            max_depth = k
    return max_depth, binary_queries

It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. synchronous ascend phase In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until x > 0). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting x > lower_bound in the while loop for some value of lower_bound > 0. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable lower_bound for the queries at the time of initialization.

Optimizations on the implementation

I have indicated that bisect operations comes with Python are faster for small lists but as the size of the list grow numpy searchsorted functions becomes faster. In my previous implementation, I have used an alternative version of bisect_right function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my previous postif things are not clear). But we cannot create an alternative version of numpy searchsorted function, therefore a suitable transformation is required on the hash itself.

Suppose we have a binary hash item = 001110. What we need is the largest number with the first 4 bits being the same as item. 001111 suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s.

def transform_right(item, h, hash_size):
    transformed_item = item[:h] + "".join(['1' for i in range(hash_size - h)])
    return transformed_right

The transformation needed to get the left most index is simple. This is 001100 which is last two bits replaced by 0s. This is same as having 0011. Therefore only a string slicing operation item[:4] will do the the job.

It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer.

def integer_transform(int_item, hash_size, h):
    str_item = ('{0:0'+str(hash_size)+'b}').format(int_item)
    transformed_str = transform_right(str_item, h, hash_size):
    transformed_int = int(transformed_str, 2)
    return transformed_int

But this is a very expensive operation for a query. For indexing, only int(binary_hash, 2) is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.

Required integer representations for left and right operations can be obtained by performing the bit wise \(AND\) and \(OR\) operations with a suitable mask. Masks are generated by the following function.

def _generate_masks(hash_size):
    left_masks, right_masks = [], []        
    for length in range(hash_size+1):
        left_mask  = int("".join(['1' for i in range(length)])
                         + "".join(['0' for i in range(hash_size-length)]), 2)
        right_mask = int("".join(['0' for i in range(length)])
                         + "".join(['1' for i in range(hash_size-length)]), 2)
    return left_masks, right_masks

These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.

def apply_masks(item, left_masks, right_masks, h):
    item_left = item & left_masks[h]
    item_right = item | right_masks[h]
    return item_left, item_right

The left most and right most indices of a sorted array can be obtained in the following fashion.

import numpy as np
def find_matching_indices(sorted_array, item_left, item_right):
    left_index = np.searchsorted(sorted_array, item_left)
    right_index = np.searchsorted(sorted_array, item_right, side = 'right')
    return np.arange(left_index, right_index)

As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.

Cached hash

This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.


Suppose the hash length is 3. Then the bit combinations are: hash_bit_table

Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary.

The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be \(2^n = 4294967296\) where \(n = 32\). This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when \(n = 16\), the size becomes \(65536\). This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16.

First a list for digits 0 and 1 is created.

digits = ['0', '1']

The the cache is created as follows.

import itertools

cache_N = 32 / 2
c = 2 ** cache_N # compute once

cache = {
   tuple(x): int("".join([digits[y] for y in x]),2)
   for x in itertools.product((0,1), repeat=cache_N)

Suppose item is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that item as follows:

xx = tuple(item)
int_item = cache[xx[:16]] * c + cache[xx[16:]]

This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.

Performance of all these techniques were measured using Pythons' line_profiler and memory_profiler. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.

July 07, 2014 12:00 AM

July 06, 2014

Vighnesh Birodkar


Humans possess an incredible ability to identify objects in an image. Image processing algorithms are still far behind this ability. Segmentation is the process of dividing an image into meaningful regions. All pixels belonging to a region should get a unique label in an ideal segmentation.

The current segmentation functions in scikit-image are too fine grained and fall closer to superpixel methods, providing a starting point for segmentation. Region Adjacency Graphs (RAGs) are a common data structure for many segmentation algorithms. As part of GSoC this year I am implementing RAGs for scikit-image. The current HEAD of scikit-image’s master branch contains my RAG implementation based on Networkx from my recent Pull Request. In the example below, we will see how Region Adjacency Graphs (RAGs) attempt to solve the segmentation problem.Please note that you need the latest master branch of scikit-image to run the following code.

Getting Started

We define the function show_img in preference to the standard call to imshow to set nice default size parameters.
We start with coffee, a nice fresh image of a coffee cup.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
from skimage import draw
import numpy as np

def show_img(img):
    width = 10.0
    height = img.shape[0]*width/img.shape[1]
    f = plt.figure(figsize=(width, height))

img = data.coffee()


Over Segmentation

We segment the image using SLIC algorithm. The SLIC algorithm will
assign a unique label to each region. This is a
localized cluster of pixels sharing some similar property, in this case their
color. The label of each pixel is stored in the labels array.

regionprops helps us compute various features of these regions. We will be
sing the centroid, purely for visualization.

labels = segmentation.slic(img, compactness=30, n_segments=400)
labels = labels + 1  # So that no labelled region is 0 and ignored by regionprops
regions = regionprops(labels)

The label2rgb function assigns a specific color to all pixels belonging to one
region (having the same label). In this case, in label_rgb each pixel is
replaces with the average RGB color of its region.

label_rgb = color.label2rgb(labels, img, kind='avg')

Just for clarity, we use mark_boundaries to highlight the region boundaries.
You will notice the the image is divided into more regions than required. This
phenomenon is called over-segmentation.

label_rgb = segmentation.mark_boundaries(label_rgb, labels, (0, 0, 0))


Enter, RAGs

Region Adjacency Graphs, as the name suggests represent adjacency of regions
with a graph. Each region in the image is a node in a graph. There is an edge
between every pair of adjacent regions (regions whose pixels are adjacent). The
weight of between every two nodes can be defined in a variety of ways. For this
example, we will use the difference of average color between two regions as
their edge weight. The more similar the regions, the lesser the weight between
them. Because we are using difference in mean color to compute the edge weight,
the method has been named rag_mean_color.

rag = graph.rag_mean_color(img, labels)

For our visualization, we are also adding an additional property to a node, the
coordinated of its centroid.

for region in regions:
    rag.node[region['label']]['centroid'] = region['centroid']

display_edges is a function to draw the edges of a RAG on its corresponding
image. It draws edges as green lines and centroids as yellow dots.
It also accepts an argument, thresh. We only draw edges with weight below this threshold.

def display_edges(image, g, threshold):
    """Draw edges of a RAG on its image

    Returns a modified image with the edges drawn.Edges are drawn in green
    and nodes are drawn in yellow.

    image : ndarray
        The image to be drawn on.
    g : RAG
        The Region Adjacency Graph.
    threshold : float
        Only edges in `g` below `threshold` are drawn.

    out: ndarray
        Image with the edges drawn.
    image = image.copy()
    for edge in g.edges_iter():
        n1, n2 = edge

        r1, c1 = map(int, rag.node[n1]['centroid'])
        r2, c2 = map(int, rag.node[n2]['centroid'])

        line  = draw.line(r1, c1, r2, c2)
        circle = draw.circle(r1,c1,2)

        if g[n1][n2]['weight'] < threshold :
            image[line] = 0,1,0
        image[circle] = 1,1,0

    return image

We call the function with thresh = infinity so that all edges are drawn. I
myself was surprised with the beauty of the following output.

edges_drawn_all = display_edges(label_rgb, rag, np.inf )


Let’s see what happens by setting thresh to 29, a value I arrived at with
some trial and error.

edges_drawn_29 = display_edges(label_rgb, rag, 29 )


Alas, the graph is cut

As you can see above, the RAG is now divided into disconnected regions. If you
notice, the table above and to the right of the dish is one big connected

Threshold Cut

The function cut_threshold removes edges below a specified threshold and then
labels a connected component as one region. Once the RAG is constructed, many similar
and more sophisticated strategies can improve the initial segmentation.

final_labels = graph.cut_threshold(labels, rag, 29)
final_label_rgb = color.label2rgb(final_labels, img, kind='avg')

Not perfect, but not that bad I’d say. My next steps will be to implement better algorithms to process the RAG after the initial segmentation.These include the merging predicates mention here and N-cut.

by Vighnesh Birodkar at July 06, 2014 03:46 PM

July 04, 2014

Matthew Rocklin

Streaming Analytics

tl;dr: We demonstrate data workflows with Python data structures and PyToolz. We also introduce join, a new operation in toolz.


In my last two posts I show that Python data structures are fast and that CyToolz, an implementation of toolz in Cython, achieves Java speeds on standard Python core data structures like dicts, lists, and tuples. As a reminder, toolz provides functions like groupby

>>> from toolz import groupby

>>> names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']
>>> groupby(len, names)
{3: ['Bob', 'Dan'],
 5: ['Alice', 'Edith', 'Frank'],
 7: ['Charlie']}

I always give this example when talking about toolz. It often spurs the following question:

That looks like GROUP BY from SQL. In what other ways does toolz let me do SQL-like operations in Python?

My answer for this is to go look at Pandas which really does a wonderful job at in-memory data analytics. Toolz targets functional programming more than it targets data analytics. Still this question is common enough to warrant a blogpost. The following is my stock answer on how to use pure Python and toolz (or cytoolz) for streaming data analytic workflows like selections, split-apply-combine, and joins. I’ll note throughout when operations are streaming (can support datasets bigger than memory) or not. This is one of the few ways in which analysis with toolz might be preferred over pandas.

Streaming Analytics

The toolz functions can be composed to analyze large streaming datasets. Toolz supports common analytics patterns like the selection, grouping, reduction, and joining of data through pure composable functions. These functions often have analogs to familiar operations in other data analytics platforms like SQL or Pandas.

Throughout this post we’ll use this simple dataset of accounts.

>>> #           id, name, balance, gender
>>> accounts = [(1, 'Alice', 100, 'F'),
...             (2, 'Bob', 200, 'M'),
...             (3, 'Charlie', 150, 'M'),
...             (4, 'Dennis', 50, 'M'),
...             (5, 'Edith', 300, 'F')]

Selecting with map and filter

Simple projection and linear selection from a sequence is achieved through the standard functions map and filter.

SELECT name, balance
FROM accounts
WHERE balance > 150;

These functions correspond to the SQL commands SELECT and WHERE.

>>> from toolz.curried import pipe, map, filter, get

>>> pipe(accounts, filter(lambda (id, name, balance, gender): balance > 150),
...                map(get([1, 2])),
...                list)

note: this uses the curried versions of map and reduce.

Of course, these operations are also well supported with standard list/generator comprehension syntax. This syntax is more often used and generally considered to be more Pythonic.

>>> [(name, balance) for (id, name, balance, gender) in accounts
...                  if balance > 150]

Split-apply-combine with groupby and reduceby

We separate split-apply-combine operations into the following two concepts

  1. Split the dataset into groups by some property
  2. Reduce each of the groups with some aggregation function

Toolz supports this common workflow with

  1. a simple in-memory solution
  2. a more sophisticated streaming solution.

In Memory Split-Apply-Combine

The in-memory solution depends on the functions groupby to split, and valmap to apply/combine.

SELECT gender, SUM(balance)
FROM accounts
GROUP BY gender;

We first show groupby and valmap separately to show the intermediate groups.

>>> from toolz import groupby, valmap, compose
>>> from toolz.curried import get, pluck

>>> groupby(get(3), accounts)
{'F': [(1, 'Alice', 100, 'F'), (5, 'Edith', 300, 'F')],
 'M': [(2, 'Bob', 200, 'M'), (3, 'Charlie', 150, 'M'), (4, 'Dennis', 50, 'M')]}

>>> valmap(compose(sum, pluck(2)),
...        _)
{'F': 400, 'M': 400}

Then we chain them together into a single computation

>>> pipe(accounts, groupby(get(3)),
...                valmap(compose(sum, pluck(2))))
{'F': 400, 'M': 400}

Streaming Split-Apply-Combine

The groupby function collects the entire dataset in memory into a dictionary. While convenient, the groupby operation is not streaming and so this approach is limited to datasets that can fit comfortably into memory.

Toolz achieves streaming split-apply-combine with reduceby, a function that performs a simultaneous reduction on each group as the elements stream in. To understand this section you should first be familiar with the builtin function reduce.

The reduceby operation takes a key function, like get(3) or lambda x: x[3], and a binary operator like add or lesser = lambda acc, x: acc if acc < x else x. It successively applies the key function to each item in succession, accumulating running totals for each key by combining each new value with the previous total using the binary operator. It can’t accept full reduction operations like sum or min as these require access to the entire group at once. Here is a simple example:

>>> from toolz import reduceby

>>> def iseven(n):
...     return n % 2 == 0

>>> def add(x, y):
...     return x + y

>>> reduceby(iseven, add, [1, 2, 3, 4])
{True: 6, False: 4}

The even numbers are added together (2 + 4 = 6) into group True, and the odd numbers are added together (1 + 3 = 4) into group False.

Note that we have to replace the reduction sum with the binary operator add. The incremental nature of add allows us to do the summation work as new data comes in. The use of binary operators like add over full reductions like sum enables computation on very large streaming datasets.

The challenge to using reduceby often lies in the construction of a suitable binary operator. Here is the solution for our accounts example that adds up the balances for each group:

>>> binop = lambda total, (id, name, bal, gend): total + bal

>>> reduceby(get(3), binop, accounts)
{'F': 400, 'M': 400}

This construction supports datasets that are much larger than available memory. Only the output must be able to fit comfortably in memory and this is rarely an issue, even for very large split-apply-combine computations.

Semi-Streaming join

We register multiple datasets together with join. Consider a second dataset that stores addresses by ID:

>>> addresses = [(1, '123 Main Street'),  # id, address
...              (2, '5 Adams Way'),
...              (5, '34 Rue St Michel')]

We can join this dataset against our accounts dataset by specifying attributes which register different elements with each other; in this case they share a common first column, id.

SELECT accounts.name, addresses.address
FROM accounts, addresses
WHERE accounts.id = addresses.id;
>>> from toolz import join, first, second

>>> result = join(first, accounts,
...               first, addresses)

>>> for ((id, name, bal, gender), (id, address)) in result:
...     print((name, address))
('Alice', '123 Main Street')
('Bob', '5 Adams Way')
('Edith', '34 Rue St Michel')

Join takes four main arguments, a left and right key function and a left and right sequence. It returns a sequence of pairs of matching items. In our case the return value of join is a sequence of pairs of tuples such that the first element of each tuple (the ID) is the same. In the example above we unpack this pair of tuples to get the fields that we want (name and address) from the result.

Join on arbitrary functions / data

Those familiar with SQL are accustomed to this kind of join on columns. However a functional join is more general than this; it doesn’t need to operate on tuples, and key functions do not need to get particular columns. In the example below we match numbers from two collections so that exactly one is even and one is odd.

>>> def iseven(n):
...     return n % 2 == 0
>>> def isodd(n):
...     return n % 2 == 1

>>> list(join(iseven, [1, 2, 3, 4],
...           isodd, [7, 8, 9]))
[(2, 7), (4, 7), (1, 8), (3, 8), (2, 9), (4, 9)]

Semi-Streaming Join

The Toolz Join operation fully evaluates the left sequence and streams the right sequence through memory. Thus, if streaming support is desired the larger of the two sequences should always occupy the right side of the join.

Algorithmic Details

The semi-streaming join operation in toolz is asymptotically optimal. Computationally it is linear in the size of the input + output. In terms of storage the left sequence must fit in memory but the right sequence is free to stream.

The results are not normalized, as in SQL, in that they permit repeated values. If normalization is desired, consider composing with the function unique (note that unique is not fully streaming.)

More Complex Example

The accounts example above connects two one-to-one relationships, accounts and addresses; there was exactly one name per ID and one address per ID. This need not be the case. The join abstraction is sufficiently flexible to join one-to-many or even many-to-many relationships. The following example finds city/person pairs where that person has a friend who has a residence in that city. This is an example of joining two many-to-many relationships because a person may have many friends and because a friend may have many residences.

>>> friends = [('Alice', 'Edith'),
...            ('Alice', 'Zhao'),
...            ('Edith', 'Alice'),
...            ('Zhao', 'Alice'),
...            ('Zhao', 'Edith')]

>>> cities = [('Alice', 'NYC'),
...           ('Alice', 'Chicago'),
...           ('Dan', 'Syndey'),
...           ('Edith', 'Paris'),
...           ('Edith', 'Berlin'),
...           ('Zhao', 'Shanghai')]

>>> # Vacation opportunities
>>> # In what cities do people have friends?
>>> result = join(second, friends,
...               first, cities)
>>> for ((name, friend), (friend, city)) in sorted(unique(result)):
...     print((name, city))
('Alice', 'Berlin')
('Alice', 'Paris')
('Alice', 'Shanghai')
('Edith', 'Chicago')
('Edith', 'NYC')
('Zhao', 'Chicago')
('Zhao', 'NYC')
('Zhao', 'Berlin')
('Zhao', 'Paris')

Join is computationally powerful:

  • It is expressive enough to cover a wide set of analytics operations
  • It runs in linear time relative to the size of the input and output
  • Only the left sequence must fit in memory


Toolz gives a compact set of primitives for data analysis on plain Python data structures. CyToolz accelerates those workflows through Cython. This approach is both low-tech and supports uncomfortably big data through streaming.

At the same time, Toolz is a general purpose functional standard library, and is not specifically dedicated to data analysis. While there are obvious benefits (streaming, composition, etc.) users interested in data analysis might be better served by using projects dedicated projects like Pandas or SQLAlchemy.

This post is also part of the toolz docs. Thanks to John Jacobsen, Clark Fitzgerald, and Erik Welch for their help with this post.

July 04, 2014 07:00 AM

July 01, 2014

Matthieu Brucher

ATK SD1 Implementation

I just released my SD1 simulation, and now it is time to explain why is inside this plugin. Not everything in the original pedal was simulated, and different schemes were used to tackle the different stages.

Block diagram

Let’s start with the block diagram.

SD1 annotated schematic

SD1 annotated schematic

In blue, I displayed what I actually implemented:

  • the drive section, what creates the SD1 specific tone
  • the tone section, with its specific filter
  • the level section, simple enough

In violet, what I didn’t implement:

  • the input buffer stage, because it is mainly resistors, so no spectral differences
  • the output buffer stage, for the same reasons, except for the HP filter that I implemented differently
  • the latch to switch the effect on/bypassed

Drive section

As all drive sections, there are different ways of handling the non linearities. As for me, I like to implement it as an ODE and solve it with a trapezoidal approximation and a Newton-Raphson optimization. The equations are a little bit different from the Simple Overdrive, but they are quite close. The main difference is that the clipper is not symmetric, which means it will introduce not only odd harmonics, but also even harmonics.

Let’s start with a visual comparison of the result, compared to the simple overdrive:

Simple overdrive for a 100Hz sinus

Simple overdrive for a 100Hz sinus

SD1 overdrive for a 100Hz sinus

SD1 overdrive for a 100Hz sinus

The two signals look quite different, and for a good reason: the SD1 simulation adds the distorted signal to the original one. But still, the simple overdrive had one sharp edge, and then a slow one, whereas the SD1 has two sharp edges. It also translates in the numerical behavior of the filter: if the simple overdrive behaved nicely on a sin sweep with an oversampling of x4, without going over 10 iterations, SD1 doesn’t converge in less than 50 iterations for frequencies higher than 7 kHz!

Let’s now compare the spectrum of those two overdrives:

Sine sweep behavior for the simple overdrive  section

Sine sweep behavior for the simple overdrive section

Sine sweep behavior for the SD1 overdrive section

Sine sweep behavior for the SD1 overdrive section

It is clear that the SD1 section is noisier than the simple overdrive (oversampling x8). You can still see the small even harmonics on the SD1 spectrogram. If the drive level is turned down, the even harmonics are actually more pronounced than with drive level turned up.

Tone section

It’s now time for the tone section. Let’s just display the different spectrums:

Spectrum of the SD1 tone section

Spectrum of the SD1 tone section

In all cases, the high frequencies are attenuated a little bit, but the main effect of the tone section is handling the medium frequencies.


The SD1 plugin consists of several ATK plugins:

  • Drive section
    1. Oversampling filter (6 points, order 5) x8
    2. SD1 overdrive
    3. Butterworth low-pass filter (Fc=20kHz, order 6)
    4. Decimation filter
  • Tone section
    1. SD1 tone filter
    2. High-pass filter (Chamberlin filter, Fc=20Hz)
  • Level section (simple volume filter)

With the small oversampling, the computation cost is manageable for a DAW usage (IMHO).

Thanks to Florent Keller for the GUI design and Oli Larkin for WDL-OL (VST2/VST3/AU support)

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at July 01, 2014 07:08 AM


John Hunter Technology Fellowship awarded

Olga Botvinnik

We’re thrilled to announce that the 2014 John Hunter Fellowship has been awarded to Olga Botvinnik, UC San Diego. The focus of her Fellowship will be to create open source analysis software for the single-cell and biology communities, and to pioneer data, code sharing, and computational reproducibility within the single-cell and RNA biology communities. She will take up this Fellowship in the second half of 2014, and collaborate with her academic mentor Gene Yeo and software mentor C. Titus Brown. You can read more about Olga’s work at her site.

The selection process was not an easy one - there were a number of high quality applicants that we would have loved to be able to award a fellowship. Honorable Mentions go to Damien Irving (University of Melbourne) and Connie Gao (MIT).

This first Numfocus Fellowship Program has gotten off to a promising start. We’re aiming to expand the program in the coming years to 5-10 fellowships each year.

by Ralf Gommers at July 01, 2014 05:00 AM

June 28, 2014

Titus Brown

Replication and quality in science - another interview

Nik Sultana, a postdoc in Cambridge, asked me some questions via e-mail, and I asked him if it would be OK for me to publish them on my blog. He said yes, so here you go!

  1. How is the quality of scientific software measured? Is there a "bug index", where software loses points if it's found to contain serious bugs in, say, a 6-month period? If not a "bug index" (with negative points) then something like an "openness index" perhaps, with points gained by more better-quality code being available?

There is no formal measurement. I've put some points here -- http://ivory.idyll.org/blog/ladder-of-academic-software-notsuck.html. Happy to discuss more, but there's nothing formal ;)

  1. The attendee list for the CW14 workshop seemed to include more bio places represented. Are there any subjects/communities within science that are more conscious of the need to share code and replicate computations than others, in your experience?

Hmm, genomics and bioinformatics are actually pretty good. I think the code quality sucks, but the point of sharing is obvious to genomicists -- so much of what we know about biology comes not from de novo analysis but rather from comparative analysis across species and experiments.

Chemists seem to be worse.

  1. I only came across initiatives such as SSI, myexperiment, figshare, datadryad, etc after I submitted my thesis. Do you think I should ask the uni (and previous places where I studied) for my money back? (No just kidding, please skip to the next question.)

Yes. :)

  1. What reassurance would you give to members of the public regarding the state of reproducibility in science? In recent writing on scientific reproducibility it is often pointed out that there have been a spate of rather high-profile retractions and corrections. How do we know there isn't some research-equivalent of Bernie Madoff somewhere?

I've seen many assurances that most unreproducible stuff seems to be unintentional - the result of sloppy work habits, etc. Moreover, in my experience, most scientists are clearly aware of the goal of reproducibility, they just don't know how to do it (esp in computational work). That having been said, anyone who fabricates data or results is going to keep very quiet about it, and I've certainly heard a number of stories.

A Bernie Madoff-style scandal is tough in a field where no one really has that much money or that high a profile. That having been said, look up Hendrik Schoen and Diederik Stapel...

  1. You've done a PhD, and have worked on various initiatives on establishing good practices in science since then. What one-sentence piece of advice would you give to young researchers starting out in science, on the theme of reproducibility of their experiments?

Just do it; your life and career will be richer for it.

  1. Do you think that the currently established system in academia or industrial research takes openness + sharing into account for career advancement? What incentives are in place, or what incentives should be made in place?


No. This is a longer discussion -- I can send you another discussion I had with a Science magazine journalist if you're interested (since posted -- see my response to Eli Kintisch) -- but everything is indirect. My openness has been great but only in an indirect fashion, in the sense that my primary metrics (grants and papers) have benefitted.

For true change, the funding agencies and journal article reviewers need to provide the incentives. NIH is starting to step up. Also see the new REF guidelines in the UK re open access -- you can see how the incentives are working there.

  1. Even if we had stable and painless-to-use technology and procedures for sharing code + data, there might be hold-back for two reasons:

    • Commercial interests might be at conflict with openness, since the latter can disadvantage commercial exploitation.
    • Scientists might fear giving other scientists an advantage, or having a mistake found in their work.

    Are these fundamental limitations of the "human science system", as it were, or are there ways around them do you think?

For (a), openness does not conflict with intellectual property rights. So I think this is simply not a concern.

For (b), those scientists seem to miss the point of science, which is to build upon others' work. This is where incentives to be open can be most useful.

  1. Do you track, or have you come across or heard of, similar infrastructure needs in humanities subjects? Presumably these subjects do need storage and collaboration technology, but maybe don't need computation as much as scientific subjects.

Digital Humanities. Talk to Ethan Watrall.


by C. Titus Brown at June 28, 2014 10:00 PM

Randy Olson

How to make beautiful data visualizations in Python with matplotlib

It’s been well over a year since I wrote my last tutorial, so I figure I’m overdue. This time, I’m going to focus on how you can make beautiful data visualizations in Python with matplotlib.

There are already tons of tutorials on how to make basic plots in matplotlib. There’s even a huge example plot gallery right on the matplotlib web site, so I’m not going to bother covering the basics here. However, one aspect that’s missing in all of these tutorials and examples is how to make a nice-looking plot.

Below, I’m going to outline the basics of effective graphic design and show you how it’s done in matplotlib. I’ll note that these tips aren’t limited to matplotlib; they apply just as much in R/ggplot2, matlab, Excel, and any other graphing tool you use.

Less is more

The most important tip to learn here is that when it comes to plotting, less is more. Novice graphical designers often make the mistake of thinking that adding a cute semi-related picture to the background of a data visualization will make it more visually appealing. (Yes, that graphic was an official release from the CDC.) Or perhaps they’ll fall prey to more subtle graphic design flaws, such as using an excess of chartjunk that their graphing tool includes by default.

At the end of the day, data looks better naked. Spend more time stripping your data down than dressing it up. Darkhorse Analytics made an excellent GIF to explain the point:


(click on the GIF for a gfycat version that allows you to move through it at your own pace)

Antoine de Saint-Exupery put it best:

Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.

You’ll see this in the spirit of all of my plots below.

Color matters

The default color scheme in matplotlib is pretty ugly. Die-hard matlab/matplotlib fans may stand by their color scheme to the end, but it’s undeniable that Tableau’s default color scheme is orders of magnitude better than matplotlib’s.

Use established default color schemes from software that is well-known for producing beautiful plots. Tableau has an excellent set of color schemes to use, ranging from grayscale to colored to color blind-friendly. Which brings me to my next point…

Many graphic designers completely forget about color blindness, which affects over 5% of the viewers of their graphics. For example, a plot using red and green to differentiate two categories of data is going to be completely incomprehensible for anyone with red-green color blindness. Whenever possible, stick to using color blind-friendly color schemes, such as Tableau’s “Color Blind 10.”

Required libraries

You’ll need the following Python libraries installed to run this code, and all of the code should be run inside an IPython Notebook:

  • IPython
  • matplotlib
  • pandas

The Anaconda Python distribution provides an easy double-click installer that includes all of the libraries you’ll need.

Blah, blah, blah… let’s get to the code

Now that we’ve covered the basics of graphic design, let’s dive into the code. I’ll explain the “what” and “why” of each line of code with inline comments.

Line plots


%pylab inline
from pandas import read_csv

# Read the data into a pandas DataFrame.
gender_degree_data = read_csv("http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv")

# These are the "Tableau 20" colors as RGB.
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]

# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.
for i in range(len(tableau20)):
    r, g, b = tableau20[i]
    tableau20[i] = (r / 255., g / 255., b / 255.)

# You typically want your plot to be ~1.33x wider than tall. This plot is a rare
# exception because of the number of lines being plotted on it.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 14))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.

# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
ylim(0, 90)
xlim(1968, 2014)

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
yticks(range(0, 91, 10), [str(x) + "%" for x in range(0, 91, 10)], fontsize=14)

# Provide tick lines across the plot to help your viewers trace along
# the axis ticks. Make sure that the lines are light and small so they
# don't obscure the primary data lines.
for y in range(10, 91, 10):
    plot(range(1968, 2012), [y] * len(range(1968, 2012)), "--", lw=0.5, color="black", alpha=0.3)

# Remove the tick marks; they are unnecessary with the tick lines we just plotted.
plt.tick_params(axis="both", which="both", bottom="off", top="off",
                labelbottom="on", left="off", right="off", labelleft="on")

# Now that the plot is prepared, it's time to actually plot the data!
# Note that I plotted the majors in order of the highest % in the final year.
majors = ['Health Professions', 'Public Administration', 'Education', 'Psychology',
          'Foreign Languages', 'English', 'Communications\nand Journalism',
          'Art and Performance', 'Biology', 'Agriculture',
          'Social Sciences and History', 'Business', 'Math and Statistics',
          'Architecture', 'Physical Sciences', 'Computer Science',

for rank, column in enumerate(majors):
    # Plot each line separately with its own color, using the Tableau 20
    # color set in order.
            gender_degree_data[column.replace("\n", " ")].values,
            lw=2.5, color=tableau20[rank])
    # Add a text label to the right end of every line. Most of the code below
    # is adding specific offsets y position because some labels overlapped.
    y_pos = gender_degree_data[column.replace("\n", " ")].values[-1] - 0.5
    if column == "Foreign Languages":
        y_pos += 0.5
    elif column == "English":
        y_pos -= 0.5
    elif column == "Communications\nand Journalism":
        y_pos += 0.75
    elif column == "Art and Performance":
        y_pos -= 0.25
    elif column == "Agriculture":
        y_pos += 1.25
    elif column == "Social Sciences and History":
        y_pos += 0.25
    elif column == "Business":
        y_pos -= 0.75
    elif column == "Math and Statistics":
        y_pos += 0.75
    elif column == "Architecture":
        y_pos -= 0.75
    elif column == "Computer Science":
        y_pos += 0.75
    elif column == "Engineering":
        y_pos -= 0.25
    # Again, make sure that all labels are large enough to be easily read
    # by the viewer.
    text(2011.5, y_pos, column, fontsize=14, color=tableau20[rank])
# matplotlib's title() call centers the title on the plot, but not the graph,
# so I used the text() call to customize where the title goes.

# Make the title big enough so it spans the entire plot, but don't make it
# so big that it requires two lines to show.

# Note that if the title is descriptive enough, it is unnecessary to include
# axis labels; they are self-evident, in this plot's case.
text(1995, 93, "Percentage of Bachelor's degrees conferred to women in the U.S.A."
       ", by major (1970-2012)", fontsize=17, ha="center")

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
text(1966, -8, "Data source: nces.ed.gov/programs/digest/2013menu_tables.asp"
       "\nAuthor: Randy Olson (randalolson.com / @randal_olson)"
       "\nNote: Some majors are missing because the historical data "
       "is not available for them", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("percent-bachelors-degrees-women-usa.png", bbox_inches="tight");

Line plots with error bars


%pylab inline
from pandas import read_csv
from scipy.stats import sem

# This function takes an array of numbers and smoothes them out.
# Smoothing is useful for making plots a little easier to read.
def sliding_mean(data_array, window=5):
    data_array = array(data_array)
    new_list = []
    for i in range(len(data_array)):
        indices = range(max(i - window + 1, 0),
                        min(i + window + 1, len(data_array)))
        avg = 0
        for j in indices:
            avg += data_array[j]
        avg /= float(len(indices))
    return array(new_list)

# Due to an agreement with the ChessGames.com admin, I cannot make the data
# for this plot publicly available. This function reads in and parses the
# chess data set into a tabulated pandas DataFrame.
chess_data = read_chess_data()

# These variables are where we put the years (x-axis), means (y-axis), and error bar values.
# We could just as easily replace the means with medians,
# and standard errors (SEMs) with standard deviations (STDs).
years = chess_data.groupby("Year").PlyCount.mean().keys()
mean_PlyCount = sliding_mean(chess_data.groupby("Year").PlyCount.mean().values,
sem_PlyCount = sliding_mean(chess_data.groupby("Year").PlyCount.apply(sem).mul(1.96).values,

# You typically want your plot to be ~1.33x wider than tall.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 9))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.

# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
ylim(63, 85)

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
xticks(range(1850, 2011, 20), fontsize=14)
yticks(range(65, 86, 5), fontsize=14)

# Along the same vein, make sure your axis labels are large
# enough to be easily read as well. Make them slightly larger
# than your axis tick labels so they stand out.
ylabel("Ply per Game", fontsize=16)

# Use matplotlib's fill_between() call to create error bars.
# Use the dark blue "#3F5D7D" as a nice fill color.
fill_between(years, mean_PlyCount - sem_PlyCount,
                mean_PlyCount + sem_PlyCount, color="#3F5D7D")

# Plot the means as a white line in between the error bars. 
# White stands out best against the dark blue.
plot(years, mean_PlyCount, color="white", lw=2)

# Make the title big enough so it spans the entire plot, but don't make it
# so big that it requires two lines to show.
title("Chess games are getting longer", fontsize=22)

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
xlabel("\nData source: www.ChessGames.com | "
        "Author: Randy Olson (randalolson.com / @randal_olson)", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("evolution-of-chess/chess-number-ply-over-time.png", bbox_inches="tight");



%pylab inline
from pandas import read_csv

# Due to an agreement with the ChessGames.com admin, I cannot make the data
# for this plot publicly available. This function reads in and parses the
# chess data set into a tabulated pandas DataFrame.
chess_data = read_chess_data()

# You typically want your plot to be ~1.33x wider than tall.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 9))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
yticks(range(5000, 30001, 5000), fontsize=14)

# Along the same vein, make sure your axis labels are large
# enough to be easily read as well. Make them slightly larger
# than your axis tick labels so they stand out.
xlabel("Elo Rating", fontsize=16)
ylabel("Count", fontsize=16)

# Plot the histogram. Note that all I'm passing here is a list of numbers.
# matplotlib automatically counts and bins the frequencies for us.
# "#3F5D7D" is the nice dark blue color.
# Make sure the data is sorted into enough bins so you can see the distribution.
hist(list(chess_data.WhiteElo.values) + list(chess_data.BlackElo.values),
        color="#3F5D7D", bins=100)

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
text(1300, -5000, "Data source: www.ChessGames.com | "
       "Author: Randy Olson (randalolson.com / @randal_olson)", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("chess-elo-rating-distribution.png", bbox_inches="tight");

Easy interactives

As an added bonus, thanks to plot.ly, it only takes one more line of code to turn your matplotlib plot into an interactive.

More Python plotting libraries

In this tutorial, I focused on making data visualizations with only Python’s basic matplotlib library. If you don’t feel like tweaking the plots yourself and want the library to produce better-looking plots on its own, check out the following libraries.

Recommended reading

Edward Tufte has been a pioneer of the “simple, effective plots” approach. Most of the graphic design of my visualizations has been inspired by reading his books.

The Visual Display of Quantitative Information is a classic book filled with plenty of graphical examples that everyone who wants to create beautiful data visualizations should read.

Envisioning Information is an excellent follow-up to the first book, again with a plethora of beautiful graphical examples.

There are plenty of other books out there about beautiful graphical design, but the two books above are the ones I found the most educational and inspiring.

Want to know how I made any of my other plots? Leave a comment and put in a request.

by Randy Olson at June 28, 2014 06:13 PM

Maheshakya Wijewardena

An illustration of the functionality of the LSH Forest

Before digging into more technical detail on the implementation of LSH forest, I thought it would be useful to provide an insightful illustration on how the best candidates are selected from the fitted data points.

For this task, I decided to use a randomly generated dataset of the sample size of 10000 in the two dimensional space. The reason of using two dimensions is that it is convenient to depict the vectors in 2D space and the a normal human being can easily grasp the idea of convergence in a 2D space.

Following configuration of the LSH forest has been selected for this illustration.

  • Number of trees = 1
  • Hash length= 32
  • c = 1
  • Lower bound of hash length at termination = 4
  • Expecting number of neighbors = 10

You can get an idea of these parameters from my previous article on the LSH forest

The following illustrations show the convergence of data points towards the query point as the considered hash length is increased. The important fact I want to emphasize here is that candidates chosen by matching the most significant hash bits converge into the actual data point we are interested in. This happens because of the amplification property of Locality sensitive hashing algorithm.

(Beware! The query point is in RED) hash length = 0 hash length = 1 hash length = 3 hash length = 4 hash length = 5 hash length = 7 hash length = 8 hash length = 24 hash length = 32

Only if the required number of candidates are not reached during the early sweeps, the algorithm will search for more candidates in smaller matching hash lengths. The the best neighbors are selected from that set of candidates by calculating the actual distance.

In my next post, I will enlighten you about the subtle optimizations done on the LSH forest data structure to find the best candidates at a maximum speed.

June 28, 2014 12:00 AM

June 27, 2014

Philip Herron


Hey so one thing that bugged me and probably others if you join a company on a huge legacy code base and the code formatting or style is all over the place. Example tabs vs spaces, gnu vs linux vs something else.And you open it in vim or in gedit or emacs and it all looks different in each editor.

So i decided to learn a little lisp:

(defun format-me ()
   (c-set-style "linux") ;; preferred c style
   (indent-region (point-min) (point-max) nil) ;; format it
   (untabify (point-min) (point-max)) ;; untabify
   (save-buffer) ;;save

So this emacs lisp function will do the formatting i want on a specified file only trouble i had was, i was unable to figure out how to in emacs lisp recursively walk the directory tree to find all the *.c *.cc *.cpp *.h *.hh to run this on, so i ended up using bash so example to run this code from bash on a file i simple run:

$ emacs -batch <file-to-format> -l ~/format-lisp-code.el -f format-me

This runs and does it in 1 go so all i had to do now was find the files i wanted to reformat:

$ find -type f -regex ".*/.*\.\(c\|cpp\|cc\|h\|hh\)"

This returned me all i cared about in our code base then all i had to do was run:

for i in `find -type f -regex ".*/.*\.\(c\|cpp\|cc\|h\|hh\)"`; do emacs -batch $i -l ~/format-lisp-code.el -f format-me; done

And voila magic emacs you are so amazing!

by Philip Herron at June 27, 2014 03:34 PM

June 26, 2014

Titus Brown

What about all those genes of unknown function?

I'm at the 2014 Marine Microbes Gordon Conference right now, and at the end of my talk, I brought up the point that the function of most genes is unknown. It's not a controversial point in any community that does environmental sequencing, but I feel it should be mentioned at least once during every session on metagenome sequencing :).

The lack of functional information for the vast majority of genes is, in my view, the broadest and biggest challenge facing environmental microbiology. Known colloquially as "microbial dark matter" (ref and Nature News), it is fair to say that we have virtually no window into what the majority of genes do. This is particularly a problem now that we can readily access them with sequencing, and several fields are racking up hundreds or thousands of data sets that are largely uninterpretable from a functional perspective.

So what are our options? What can we do to characterize new genes? There seem to be two poles of opinions: many experimentalists argue that we need to devote significant efforts to doing more microbial physiology, which is, after all, how we know most of what we already know. People at the other pole seems to think that if we do enough sequencing, eventually meaning will emerge - enough correlations will turn into causation. (While these are obviously caricatures, I think they capture most of the range of opinions I've heard, so I like 'em ;).

Neither course seems likely to me. Nobody is going to fund hundreds or thousands of graduate student projects to characterize the physiology of individual microbial species, which is more or less the scale of effort needed. Similarly, while the sequencing folk have clearly been "winning" (in the sense that there's a lot of sequencing being done!) there's a growing backlash against large-scale sequencing without a fairly pointed set of hypotheses behind them. This backlash can be understand as a natural development -- the so-called trough of disillusionment in the adoption and understanding of new technologies -- but that makes it no less real.

Over the past few years, I've had the opportunity to discuss and debate a variety of approaches to characterizing gene function in microbes. Since I'm thinking about it a lot during this meeting, I thought I'd take the time to write down as many of the ideas as I can remember. There are two purposes to this -- first, I'm trawling for new ideas, and second, maybe I can help inspire people to tackle these challenges!

Without further ado, here is my list, broken down into somewhat arbitrary categories.

Experimental exploration and determination of gene function

  1. Finding genetic systems and doing the hard work.

    This is essentially the notion that we should focus in on a few model systems that are genetically tractable (culturable, transformable, and maybe even possessed of genome editing techniques) and explore, explore, explore. I'm not sure which microbes are tractable, or could be made tractable, but I gather we are lacking model systems representative of a broad swath of marine microbes, at least.

    The upsides to this approach are that we know how to do it, and all of the modern -omics tools can be brought to bear to accelerate progress: genome, transcriptome, and proteome sequencing, as well as metabolomics.

    The downsides are that this approach is slow to start, time consuming, and not particularly scalable. Because of that I'm not sure there's much support for funding.

  2. Transcriptome assisted culture.

    A persistent challenge for studying microbes is that many of them cannot be easily cultured, which is a soft prerequisite for studying them in the lab. We can't culture them because often we don't know what the right culture conditions are -- what do they eat? Who do they like to hang out with?

    One of the neater approaches to resolving this is the concept of transcriptome assisted culture, which Irene Newton (@chicaScientific) pointed out to me in this neat PNAS paper on culturing Coxiella. Essentially, Omsland et al. used transcriptome sequencing in conjunction with repeated modifications to the culture medium to figure out what the right growth medium was. In addition to converting an important biomedical pathogen into something more easily culturable, the authors gained important insights into its basic metabolism and the interaction of Coxiella with its host cells.

    Upsides? It's an approach that's readily enabled by modern -omics tools, and it should be broadly applicable.

    Downsides? Time consuming and probably not that scalable. However, it's a rather sexy approach to the hard slog of understanding organisms (and you can argue that it's basically the same as the model organism approach) so it's marginally more fundable than just straight up physiology studies.

  3. Enrichment cultures.

    Another culture-based approach is the enrichment culture, in which a complex microbial community (presumably capable of driving many different biogeochemical processes) is grown in a more controlled environment, usually one enriched for a particular kind of precursor. This can be done with a flow reactor approach where you feed in precursors and monitor the composition of the outflow, or just by adding specific components to a culture mix and seeing what grows.

    For one example of this approach, see Oremland et al., 2005, in which the authors isolated a microbe, Halarsenatibacter silvermanii, which metabolized arsenic. They did this by serial transfer of the cells into a fresh medium and then purifying the organism that persistently grew through serial dilution at the end.

    This is a bit of a reverse to the previous methods, where the focus was on a specific organism and figuring out how it worked; here, you can pick a condition that you're interested in and figure out what grows in it. You can get both simplified communities and potentially even isolates that function in specific conditions. (Also see Winogradsky columns for a similar environment that you could study.) You still need to figure out what the organisms do and how they do it, but you start with quite a bit more information and technology than you would otherwise - most importantly, the ability to maintain a culture!

    Pros: this is actually a lot more scalable than the model-organism or culture-focused techniques above. You could imagine doing this on a large scale with a fairly automated setup for serial transfer, and the various -omics techniques could yield a lot of information for relatively little per-organism investment. Someone would still need to chase down the genes and functions involved, but I feel like this could be a smallish part of a PhD at this point.

    Cons: it's not terribly hypothesis driven, which grant agencies don't always like; and you might find that you don't get that much biological novelty out of the cultures.

  4. Functional metagenomics

    You can also understand what genes do by putting them into tractable model organisms. For example, one of the ways that Ed DeLong's group showed that proteorhodopsin probably actually engaged in photosynthesis was by putting the gene in E. coli. At the time, there was no way to investigate the gene (from an uncultured SAR86) in its host organism, so this was the only way they could "poke" at it.

    A significant and important extension of this idea is to transfer random fragments from metagenomic fosmid or BAC libraries into large populations of (say) E. coli, and then do a selection experiment to enrich for those critters that can now grow in new conditions. For example, see this paper on identifying the gene behind the production of certain antibiotics (hat tip to Juan Ugalde (@JuanUgaldeC for the reference). Also see the "heterologous expression" paragraph in Handelsman (2004), or this other antibiotic resistance paper from Riesenfeld et al. (2004) (hat tips to Pat Schloss (@Pat Schloss), Jeff Gralnick (@bacteriality), and Daan Speth (@daanspeth) for the refs!).

    Pros: when it works, it's awesome!

    Cons: most genes function in pathways, and unless you transfer in the whole pathway, an individual gene might not do anything. This has been addressed by transferring entire fosmids with whole operons on them between microbes, and I think this is still worth trying, but (to me) it seems like a low-probability path to success. I could be wrong.

  5. Synthetic biology

    Why not just build a new critter genome using synthetic biology approaches, and see how it works? This is a radical extension of the previous idea of transferring genes between different organisms. Since we can now print long stretches of DNA on demand, why not engineer our own pathways and put them into tractable organisms to study in more detail?

    I think this is one of the more likely ideas to ultimately work out, but it has a host of problems. For one thing, you need to have strong and reliable predictions of gene function. For another, not all microbes will be able to execute all pathways, for various biochemical reasons. So I expect the failure rate of this approach to be quite high, at least at first.

    Pros: when it works, it'll be awesome! And, unlike the functional metagenomics approach, you can really engineer anything you want - you don't need to find all your components already assembled in a PCR product or fosmid.

    Cons: expensive at first, and likely to have a high failure rate. Unknown scalability, but probably can be heavily automated, especially if you use selection approaches to enrich for organisms that work (see previous item).

Computational exploration and determination of gene function

  1. Metabolic modeling

    Look at the genome, feed it into a model of metabolism, and try to understand what genes are doing and what genes are missing. Metabolic flux analysis provides one way to quickly identify whether a given gene complement is sufficient to "explain" observed metabolism, but I'm unsure of how well it works for badly annotated genomes (my guess? badly ;).

    You can marry this kind of metabolic analysis with the kind of nifty fill-in-the-blank work that Valerie de Crecy-Lagard does -- I met Valerie a few years back on a visit to UFL, and thought, hey, we need hundreds of people like her! Valerie tracks down "missing" pathway genes in bacterial genomes, using a mixture of bioinformatics and experimental techniques. This is going to be important if you're predicting metabolic activity based on the presence/absence of annotated genes.

    In practice, this is going to be much easier in organisms that are phylogenetically closer to model systems, where we can make better use of homology to identify likely mis-annotated or un-annotated genes. It also doesn't help us identify completely new functions except by locating missing energy budgets.

    Pros: completely or largely computational and hence potentially quite scalable.

    Cons: completely or largely computational, so unlikely to work that well :). Critically dependent on prior information, which we already know is lacking. And hard or impossible to validate; until you get to the point where on balance the predictions are not wrong, it will be hard to get people to consider the expense of validation.

  2. Gene-centric metabolic modeling

    Rather than trying to understand how a complete microbe works, you can take your cue from geochemistry and try to understand how a set of genes (and transcripts, and proteins) all cooperate to execute the given biogeochemistry. The main example I know of this is from Reed et al. 2013, with Julie Huber (@JulesDeep) and Greg Dick.

    Pros: completely or largely computational and hence potentially quite scalable.

    Cons: requires a fair bit of prior information. But perhaps easier to validate, because you get predictions that are tied closely to a particular biogeochemistry that someone already cares about.

  3. Sequence everything and look for correlations.

    This is the quintessential Big Data approach: if we sequence everything, and then correlate gene presence/absence/abundance with metadata and (perhaps) a smattering of hypotheses and models, then we might be able to guess at what genes are doing.

    Aaron Garoutte (@AaronGoroutte) made the excellent point that we could use these correlations as a starting point to decide which genes to invest more time and energy in analyzing. When confronted with 100s of thousands of genes -- where do you start? Maybe with the ones that correlate best with the environmental features you're most interested in ...

    Pros: we're doing the sequencing anyway (although it's not clear to me that the metadata is sufficient to follow through, and data availability is a problem). Does not rely on prior information at all.

    Cons: super unlikely to give very specific predictions; much more likely to provide a broad range of hypotheses, and we don't have the technology or scientific culture to do this kind of work.

  4. Look for signatures of positive selection across different communities.

    This is an approach suggested by Tom Schmidt and Barry Williams, for which there is a paper soon to be submitted by Bjorn Ostman and Tracy Teal et al. The basic idea is to look for signatures of adaptive pressures on genes in complex metagenomes, in situations where you believe you know what the overall selection pressure is. For example, in nitrogen-abundant situations you would expect different adaptive pressures on genes than in more nitrogen-limited circumstances, so comparisons between fertilized and unfertilized soils might yield something interesting.

    Pros: can suggest gene function without relying on any functional information at all.

    Cons: unproven, and the multiple-comparison problem with statistics might get you. Also, needs experimental confirmation!

My favorite idea - a forward evolutionary screen

  1. Here's an idea that I've been kicking around for a while with (primarily) Rich Lenski (@RELenski), based on some Campylobacter work with JP Jerome and Linda Mansfield.

    Take fast evolving organisms (say, pathogens), and evolve them in massive replicate on a variety of different carbon sources or other conditions (plates vs liquid; different host genotypes; etc.) and wait until they can't cross-grow. Then, sequence their genomes and figure out what genes have been lost. You can now assume that genes that are lost are not important for growing in those other conditions, and put them in a database for people to query when they want to know what a gene might not be important for.

    We saw just this behavior in Campylobacter when we did serial transfer in broth, and then plated it on motility assay plates: Campy lost its motility genes, first reversibly (regulation) and then irreversibly (conversion to pseudogene).

    Harriet Alexander (@nekton4plankton) pointed out to me that this bears some similarity to the kinds of transposon mutagenesis experiments that were done in many model organisms in the 90s - basically, forward genetics. Absolutely! I have to think through how useful forward genetics would be in this field a bit more thoroughly, though.

    Pros: can be automated and can scale; takes advantage of massive sequencing; should find lots of genes.

    Cons: potentially quite expensive; unlikely to discover genes specific to particular conditions of interest; requires a lot of effort for things to come together.

So that's my list.

Can't we all get along? A need for complementary approaches.

I doubt there's a single magical approach, a silver bullet, that will solve the overall problem quickly. Years, probably decades, of blood, sweat, and tears will be needed. I think the best hope, though, is to find ways to take advantage of all the tools at our disposal -- the -omics tools, in particular -- to tackle this problem with reasonably close coordination between computational and experimental and theoretical researchers. The most valuable approaches are going to be the ones that accelerate experimental work by utilizing hypothesis generation from large data sets, targeted data gathering in pursuit of a particular question, and pointed molecular biology and biochemistry experiments looking at what specific genes and pathways do.

How much would this all cost?

Suppose I was a program manager and somebody gave me \$5m a year for 10 years to make this happen. What would be my Fantasy Grants Agency split? (Note that, to date, no one has offered to give me that much money, and I'm not sure I'd want the gig. But it's a fun brainstorming approach!)

I would devote roughly a third of the money to culture-based efforts (#1-#3), a third to building computational tools to support analysis and modeling (#6-#9), and a third to developing out the crazy ideas (#4, #5, and #10). I'd probably start by asking for a mix of 3 and 5 year grant proposals: 3 years of lots of money for the stuff that needs targeted development, 5 years of steady money for the crazier approaches. Then I'd repeat as needed, trying to balance the craziness with results.

More importantly, I'd insist on pre-publication sharing of all the data within a walled garden of all the grantees, together with regular meetings at which all the grad students and postdocs could mix to talk about how to make use of the data. (This is an approach that Sage Biosciences has been pioneering for biomedical research.) I'd probably also try to fund one or two groups to facilitate the data storage and analysis -- maybe at \$250k a year or so? -- so that all of the technical details could be dealt with.

Is \$50m a lot of money? I don't think so, given the scale of the problem. I note that a few years back, the NIH NIAID proposed to devote 1-3 R01s (so \$2-4m total) to centers devoted to exploring the function of 10-20 pathogen genes each, so that's in line with what I'm proposing for tackling a much larger problem.


by C. Titus Brown at June 26, 2014 10:00 PM

Matthieu Brucher

Announcement: Audio TK 0.3.0

Just after the release of ATK SD1, I updated my audio toolkit. I added a few optimizations on overdrive computations, and also for the base filter array exchange.

Future releases should encompass compressors and amplifiers.

Download link: ATK 0.3.0


* Enhanced the ScalarNewtonRaphson algorithm to accept different precision. Used a more relaxed one by default
* Fixed a bug in SD1OverdriveFilter
* Optimized array copies between filters
* Added documentation

* Fix some issues with deploying ATK
* Tone stacks (Bassman and JCM800) with Python wrappers
* Updated filters C++ interface when then can process several channels at the same time

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at June 26, 2014 07:19 AM

June 25, 2014

Janani Padmanabhan

With those mid-sem bells chiming, it is time for another update.
The following checkpoints have been reached:

1. The implementation of ellipsoidal harmonic function (also known as Lame's function): The first kind.
The following is the link to the pull request:
The implementation is in Cython and calls LAPACK subroutine. It is based on the python implementation by Knepley and Bardhan given here
Further the absence of Lame's function implementation by any many other libraries there is a challenge in preparation of an extensive test-suite. At present the output of the function for certain range of inputs is tested. The immediate next plan is to try improving the test-suite.
This will be followed by the implementation of Ellipsoidal harmonic function: The second kind.

2. Before this the spherical harmonic functions were improved by reimplementing them in Cython rather than python thus improving the speed.
The details having been elaborately touched in the previous post, are omitted here, saving people from the boredom due to redundancy. The pull request can be accessed from https://github.com/scipy/scipy/pull/3692

Thanks to the constant support of my brilliant mentors Pauli, Ralf and Stefan, (and I suppose, few miracles!) the progress is as per schedule. Hoping that this pace will be maintained or even better, quickened; post mid-sem evaluation!

Signing off till next time,

by noreply@blogger.com (janani padmanabhan) at June 25, 2014 05:59 PM

June 24, 2014

Juan Nunez-Iglesias


This year I am privileged to be a mentor in the Google Summer of Code for the scikit-image project, as part of the Python Software Foundation organisation. Our student, Vighnesh Birodkar, recently came up with a clever use of SciPy’s ndimage.generic_filter that is certainly worth sharing widely.

Vighnesh is tasked with implementing region adjacency graphs and graph based methods for image segmentation. He initially wrote specific functions for 2D and 3D images, and I suggested that he should merge them: either with n-dimensional code, or, at the very least, by making 2D a special case of 3D. He chose the former, and produced extremely elegant code. Three nested for loops and a large number of neighbour computations were replaced by a function call and a simple loop. Read on to find out how.

Iterating over an array of unknown dimension is not trivial a priori, but thankfully, someone else has already solved that problem: NumPy’s nditer and ndindex functions allow one to efficiently iterate through every point of an n-dimensional array. However, that still leaves the problem of finding neighbors, to determine which regions are adjacent to each other. Again, this is not trivial to do in nD.

scipy.ndimage provides a suitable function, generic_filter. Typically, a filter is used to iterate a “selector” (called a structuring element) over an array, compute some function of all the values covered by the structuring element, and replace the central value by the output of the function. For example, using the structuring element:

fp = np.array([[0, 1, 0],
               [1, 1, 1],
               [0, 1, 0]], np.uint8)

and the function np.median on a 2D image produces a median filter over a pixel’s immediate neighbors. That is,

import functools
median_filter = functools.partial(generic_filter,

Here, we don’t want to create an output array, but an output graph. What to do? As it turns out, Python’s pass-by-reference allowed Vighnesh to do this quite easily using the “extra_arguments” keyword to generic_filter: we can write a filter function that receives the graph and updates it when two distinct values are adjacent! generic_filter passes all values covered by a structuring element as a flat array, in the array order of the structuring element. So Vighnesh wrote the following function:

def _add_edge_filter(values, g):
    """Add an edge between first element in `values` and
    all other elements of `values` in the graph `g`.
    `values[0]` is expected to be the central value of
    the footprint used.

    values : array
        The array to process.
    g : RAG
        The graph to add edges in.

    0.0 : float
        Always returns 0.

    values = values.astype(int)
    current = values[0]
    for value in values[1:]:
        g.add_edge(current, value)
    return 0.0

Then, using the footprint:

fp = np.array([[0, 0, 0],
               [0, 1, 1],
               [0, 1, 0]], np.uint8)

(or its n-dimensional analog), this filter is called as follows on labels, the image containing the region labels:


This is a rather unconventional use of generic_filter, which is normally used for its output array. Note how the return value of the filter function, _add_edge_filter, is just 0! In our case, the output array contains all 0s, but we use the filter exclusively for its side-effect: adding an edge to the graph g when there is more than one unique value in the footprint. That’s cool.

Continuing, in this first RAG implementation, Vighnesh wanted to segment according to average color, so he further needed to iterate over each pixel/voxel/hypervoxel and keep a running total of the color and the pixel count. He used elements in the graph node dictionary for this and updated them using ndindex:

for index in np.ndindex(labels.shape):
    current = labels[index]
    g.node[current]['pixel count'] += 1
    g.node[current]['total color'] += image[index]

Thus, together, numpy’s nditer, ndindex, and scipy.ndimage’s generic_filter provide a powerful way to perform a large variety of operations on n-dimensional arrays… Much larger than I’d realised!

You can see Vighnesh’s complete pull request here and follow his blog here.

If you use NumPy arrays and their massive bag of tricks, please cite the paper below!

Stefan Van Der Walt, S. Chris Colbert, & Gaël Varoquaux (2011). The NumPy array: a structure for efficient numerical computation Computing in Science and Engineering 13, 2 (2011) 22-30 arXiv: 1102.1523v1

by Juan Nunez-Iglesias at June 24, 2014 08:48 AM

Matthieu Brucher

Announcement: ATK SD1 1.0.0

I’m happy to announce the release of my third audio plugin, the first based on the Audio Toolkit. It is available on Windows and OS X in different formats.

The UI was designed by Florent Keller, many thanks to him!



The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

The files can be downloaded on SourceForge, as well as the source code.


I upgraded the Audio Toolkit to fix some 32bits cracks, so please download 1.0.1 (link is up to date).

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matthieu Brucher at June 24, 2014 08:00 AM

June 23, 2014

Issam Laradji

Mid-term Summary 
GSoC 2014: Extending Neural Networks Module for Scikit-Learn

The objective is to implement neural network algorithms in a clean, well-tested code using the scikit-learn API. The algorithms are meant to be user-friendly and easy to edit and scale for those who wish to extend, or debug them if necessary.

Since the start of GSoC 2014 until now, I completed two modules, multi-layer perceptron (mlp) #3204 and mlp with pre-training #3281, which are pending final review for merging. I also implemented the extreme learning machine (elm) algorithm #3306 which hasn't been reviewed yet and more components such as test files, examples, and documentations are required. However, I am confident that I will complete it by the deadline I set in the proposal.

In the following three sections, I will explain the modules in more detail.

1) Multi-layer perceptron  #3204
Multi-layer perceptron (MLP) supports more than one hidden layer allowing it to construct highly non-linear functions. Figure 1 displays an MLP with 1 hidden layer.

Figure 1

To define the number of hidden layers and their neurons, one can simply run the following statement.

The list '[150, 100]' means that two hidden layers are constructed with 150 and 100, neurons respectively.

Further, MLP can be used for reinforcement learning where each time step makes a new training sample. It can use the `partial_fit` method to update its weights on per sample basis in real-time (stochastic update).

MLP also consists of a regularization term `alpha` as part of its parameters, whose value determines the degree of non-linearity the function is meant to have. Therefore, if the algorithm is overfitting, it is desirable to increase `alpha` to have a more linear function. Figure 2 demonstrates  the decision boundaries learnt by mlp with different alpha values.

Figure 2

Figure 2 shows that the higher the value of `alpha`, the less curves the decision boundary will have.

The implementation has passed through various test cases to prevent unexpected behavior. One of the test cases involves comparing between the algorithm's analytic computation of the gradient and its numerical computation. Since the difference between the values was found to be at most a very small value means the backpropagation algorithm is working as expected.

2) MLP with pre-training #3281

One issue with MLP is that it involves random weights' initialization. The weights could land in a poor position in the optimization (see Figure 3) whose final solutions are not as good as they could be.

Figure 3

Pre-training is one scheme to have the initial weights land in a better start. Restricted boltzmann machines (RBMs) can find such initial weights. Figure 4 shows the process of pre-training.

Figure 4
For each layer in the network, there is an RBM that trains on the inputs given for that layer. The final weights of the RBM are given as the initial weights of the corresponding layer in the network.

Running an example of pre-training has showed that RBMs can improve the final performance. For instance, on the digits the dataset, the following results were obtained.

1) Testing accuracy of mlp without pretraining: 0.964
2) Testing accuracy of mlp with pretraining: 0.978

3) Extreme learning machine (elm) #3306 

Much of the criticism towards MLP is in its long training time. MLP uses the slow gradient descent to updates its weights iteratively, involving many demanding computations.

Extreme learning machines (ELMs) [1], on the other hand, can train single hidden layer feedforward networks (SLFNs) using least square solutions instead of gradient descent. This scheme requires only few matrix operations, making it much faster than gradient descent. It also has a strong generalization power since it uses least-squares to find its solutions.

The algorithm has been implemented and it passed the travis tests. But it still awaits more thorough review and test files to anticipate errors.

I believe I will finalize the module by 29 June as per the proposal.

Remaining work

In the remaining weeks my tasks are broken down as follows.

Week 7, 8 (June 30 - July 13)

I will implement and revise regularized ELMs [3] and weighted ELMs [4], and extend the ELMs documentation.

Week 9, 10  (July 14- July 27)

I will implement and revise Sequential ELMs [2], and extend the ELMs documentation.

Week 11, 12 (July 28- August 10)

I will implement and revise Kernel-Based ELMs, and extend the ELMs documentation.

Week 13 - Wrap-up


I would like to thank my mentors and reviewers including @ogrisel, @larsmans @jnothman, @kasternkyle, @AlexanderFabisch for dedicating their time in providing useful feedback and comments, making sure the work meets high-quality standards. I sincerely appreciate the time PSF admins take to oversee the contributers as it encourages us to set a higher bar for quality work. I would also like to thank GSoC 2014, as this wouldn't have been possible if it hadn't been for their support and motivation.


[1]    http://www.di.unito.it/~cancelli/retineu11_12/ELM-NC-2006.pdf

[2]    http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf

[3]    http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-Unified-Learning.pdf

[4]   Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen. "Weighted extreme learning machine for imbalance learning." Neurocomputing 101 (2013): 229-242.

by noreply@blogger.com (Issam Laradji) at June 23, 2014 02:09 AM

June 22, 2014

Vighnesh Birodkar


So far

For the story so far see –  the Introduction, 2nd week update, Graph Data Structures Comparison .

After much debate on the mailing list with many scikit-image developers we finally decided to use the NetworkX Graph Class for our Region Adjacency Graph ( RAG ). It comes with a lot of well-tested functionality and would speed up the GSoC progress. It is also pure Python, and shares a lot of its dependencies with scikit-image.

Constructing the RAG

To construct a RAG, we need to iterate over the entire labeled image, looking for adjacent pixels with distinct labels. Initially I wrote a special case Cython loop for 2D and 3D, much like this one. But to scikit-image developers suggested, and rightly so, a more genaral n-dimensional approach. I looked for a function, which would iterate over an entire array and call a given function. As it turns out generic_filter does exactly that. I have used it here  with the callable defined here.

The footprint is created such that only the elements which are 2nd and 3rd along all axes are set according to the connectivity. Rest all are forced to zero. In the 2D case with connectivity 2 it would be the bottom right elements ( a[1,1], a[1,2] , a[2,1], a[2,2] ) which are set . This ensures that one pair of adjacent pixels is not processed again when the filter window moves ahead.

The footprint ensures that the first element in the array passed to the callable is the central element in the footprint. All other elements are adjacent to the central element and an edge is created between them and the central element.

Pruning the RAG

I implemented a skeletal RAG algorithm following Juan’s suggestion. It takes the labeled image as input. This is typically an over-segmented image obtained by algorithms like SLIC or watershed. Each region in the labeled image is represented by a node in the graph. Nodes of adjacent regions are joined by an edge. The weight of this edge is the magnitude of difference in mean color. Once the graph is constructed, nodes joined by edges with weights lesser than a given threshold are combined into one region.

You can view the Pull Request here.


Below are the test results of executing the example code on two scikit-image sample images. The threshold value for both the results is different and is found by trial and error. Typically, a higher threshold value, gives fewer regions in the final image.

Coffee Image

Original Image


SLIC Segmentation


Threshold Graph Cut




Original Image


SLIC Segmentation


Threshold Graph Cut



The mechanism for RAGs is in place now. The segmentation is pretty satisfactory, considering the simple logic. The major drawback however is that it’s not fully automatic. Over the next few weeks I will implement more sophisticated merging predicates, including N-cut.

by Vighnesh Birodkar at June 22, 2014 02:14 PM

Hamzeh Alsalhi

Google Summer of Code Midterm Progress Summary

The first half of my summer of code has resulted in the implementation of just under half of my goals, one of the planned six pull requests has been completely finalized, two are on their way to being finalized, and two of the remaining three will be started shortly. The final pull request is a more independent feature which I aim to start as soon as I am confident the others are close to being wrapped up.

Thank you Arnaud, Joel, Oliver, Noel, and Lars for the time taken to give the constructive criticism that has vastly improved my implementation of these changes to the code base.

Sparse Input for Ensemble Methods

Gradient Boosted Regression Trees is the one remaining ensemble method that needs work before all of scikit-learns ensemble classifiers support sparse input. The first pull request I made this summer was for sparse input on the AdaBoost ensemble method. The AdaBoost pull request was merged, after AdaBoost sparse input support was completed I have skipped to latter goals with the intention to come back and pickup work on GBRT when a pending pull request for sparse input decision trees is merged which will make it easier to continue work on the sparse input for the ensemble method.

PR #3161 - Sparse Input for AdaBoost
Status: Completed and Merged
Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized.

PR - Sparse input Gradient Boosted Regression Trees (GBRT)
Status: To be started
Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends heavily on the sparse support for decision trees which is a pending mature pull request here PR #3173.

Sparse Output Support

The Sparse Label Binarizer pull request has gone through numerous revisions after being based of existing code written in PR and it contains a large part of the work necessary to support sparse output for One vs. Rest classification. With this support in place many of the binary classifiers in scikit-learn can be used in a one vs. all fashion on sparse target data. Support for sparse target data in the multilabel metrics will be implemented to provide users with metrics while avoiding the need to densify the target data. Finally in attempt to push support for sparse target data past one vs. rest methods I will work on spare target data support for decision trees .

PR #3203 - Sparse Label Binarizer
Status: Nearing Completion
Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer.

PR #3276 - Sparse Output One vs. Rest
Status: Work In Progress
Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data.

PR - Sparse Metrics
StatusTo be started
Summary of the work done: Modify the metrics and some misc tools to support sparse target data so sparsity can be maintained throughout the entire learning cycle. The tools to be modified  include precision score, accuracy score, parameter search, and other metrics listed on scikit-learns model evaluation documentation under the classification metrics header.

PR - Decision Tree and Random Forest Sparse Output
StatusTo be started
Summary of the work done: Make revisions in the tree code to support sparsely formatted target data and update the random forest ensemble method to use the new sparse target data support.

Plan for the Coming Weeks

In hopes that the sparse label binarizer will be merged soon after making final revisions, early next week I will begin to respond to the reviews of the sparse One vs. Rest pull request and we will also see the beginnings of the sparse metrics pull request which should be wrapped up and ready for reviews in the same week. Following that the next focus will be rewinding to sparse input for ensemble methods and putting a week of work into sparse support for GBRT. Finally the sparse output decision tree pull request will be started when the remaining goals are nearing completion.

by noreply@blogger.com (Hamzeh) at June 22, 2014 03:48 AM

June 20, 2014


The Latest and Greatest Pandas Features (since v 0.11)

On May 28, 2014 Phillip Cloud, core contributor for the Pandas data analytics Python library, spoke at a joint meetup of the New York Quantitative Python User’s Group (NY QPUG) and the NY Finance PUG. Enthought hosted and about 60 people joined us to listen to Phillip present some of the less-well-known, but really useful […]

by info at June 20, 2014 09:36 PM

Richard Tsai

GSoC2014: The First Month

The first month of my GSoC has passed and it is about time to make a summary.

The PRs

I’ve made 4 Pull-Requests for my GSoC project, 3 of which have been merged.

  • #3636 MAINT: cluster: some clean up Some cleanups. Mainly for docs.

  • #3683 ENH: cluster: rewrite and optimize vq in Cython The most challenging part. Discussed in this post already.

  • #3702 BUG: cluster: _vq unable to handle large features When I started to work on #3712, I noticed that the new version of _vq I wrote in the previous PR(#3683) may give wrong results when the values of the input features are very large. I though it might be an overflow issue since there was a matrix multiplication in the new algorithm. But I was wrong. It was caused by an initialization issue. I used the C macro NPY_INFINITY to initialize the output distances vector. However, ndarray.fill would regard NPY_INFINITY as the “integral infinity” and fill the float (or double) vector with \(2 ^ {31} - 1\). When the actual minimum distance was greater than such an “infinity”, the result would be an uninitialized value. The currect way to initialize the output distances vector should be outdists.fill(np.inf).

  • #3712 ENH: cluster: reimplement the update-step of K-means in Cython This PR is my recent work and hasn’t been merged yet.

    An iteration of K-means requires two steps, the assignment step and the update step. The assignment step, implemented as vq.vq, has been optimized in the previous PRs. But the update step is still time-consuming, especially for low-dimensional data. The current implementation of the update step in SciPy looks like:

    for j in range(nc):
        mbs = np.where(label == j)
        code[j] = np.mean(data[mbs], axis=0)

    However, np.where is slow. A much more efficient algorithm would be:

    for j in range(nobs):
        code[labels[j], :] += data[j, :]
        counts[labels[j]] += 1
    for j in range(nc):
        code[j] /= counts[j]

    But of course looping through the observation matrix in pure python would be extremely slow. So I implemented it in Cython and archieved decent speedups especially in low-dimensional data.


First of all, I rewrote the underlying part of cluster.vq in Cython and it should be easier to maintain than the original C implementation.

Then, the performance. The new algorithms and optimization tricks work well. I built SciPy with ATLAS and tested vq.kmeans(obs, k) on a i5-3470 machine, and I got the following performance statistics.

30000 x 100, k = 10, float6428636.259416950.388412687.4252.26
30000 x 100, k = 10, float3213575.05326483.47345272.22.57
10000 x 30, k = 10, float641923.39461146.7348752.1932.56
10000 x 30, k = 10, float321491.879989.209658.10642.27
100000 x 3, k = 4, float641663.1431181.5912529.09923.14
100000 x 3, k = 4, float321278.033967.4396431.02482.97
10000 x 1, k = 3, float64117.690894.062241.61862.83
10000 x 1, k = 3, float32110.287498.682641.99042.63
1000 x 3, k = 4, float6429.2328.6979.85422.97
1000 x 3, k = 4, float3227.524229.248812.48942.20

What’s Next

There’re still about two months to go. First, I will continue to work on #3712. This PR still has some issues. There’re some build warnings which seem to be caused by Cython fused types on some machine. Then, I will start to work on cluster.hierarchy. The hierarchy module will be more complicated and challenging then vq and it may take me about a month to rewrite it in Cython. And then I’ll try to implement the SLINK and CLINK algorithm.

At last, I want to thank my mentors Ralf, Charles and David, and of course everyone else in the SciPy community, who have helpped me a lot. I’m very glad to be working with you!

by Richard at June 20, 2014 05:07 PM


GSoC Open Source Brain: Retinal Filter II

LGN-Retinal Filter II

Now that we know how to crate a filter is time to use it to calculate how an LGN neuron would react to an incoming stimulus. In this entry we will create a white noise stimulus in order to see how an LGN neuron reacts to it, this approach has the advantage that we can then recover the filter by reverse correlation methods as a sanity check.

In the same spirit of the last post, we will define the spatial and time parameters that determine the lengths and resolutions in those dimensions:

#Time parameters
dt = 1.0 # resolution of the response (in milliseconds)
dt_kernel = 5.0 # resolution of the kernel (in milliseconds)
dt_stimuli = 10.0 # resolution of the stimuli (in milliseconds)

kernel_size = 25 # The size of the kernel

T_simulation = 2 * 10 ** 2.0 # Total time of the simulation in ms
Nt_simulation = int(T_simulation / dt) #Simulation points
N_stimuli = int(T_simulation / dt_stimuli) #Number of stimuli

# Space parameters
dx = 1.0
Lx = 20.0
Nx = int(Lx / dx)
dy = 1.0
Ly = 20.0
Ny = int(Ly / dy )

Now, we call our kernel which we have wrapped-up as a function from the work in the last post:

# Call the kernel
# Size of center area in the center-surround profile
sigma_center = 15
# Size of surround area in the center-surround profile
sigma_surround = 3
kernel = create_kernel(dx, Lx, dy, Ly, sigma_surround,
sigma_center, dt_kernel, kernel_size)

With this in our hand we can use the numpy random functions to create our white noise stimuli, we use here the realization of white noise call ternary noise which consists on values of -1, 0 and 1 assigned randomly to each pixel in our stimuli:

# Call the stimuli
stimuli = np.random.randint(-1, 2, size=(N_stimuli, Nx, Ny))

Before we can proceed to calculate the convolution we need to do some preliminary work. The convolution problem involves three time scales with different resolutions. We have first the resolution of the response dt , the resolution of the kernel dt_kernel and finally the resolution of the stimulus dt_stimuli.Operations with the kernel involve jumping from one scale to another constantly so we need a mechanism to keep track of that. In short, we would like to have a mechanism that transforms from some coordinates to the others in one specific place and not scatter all over the place.

Furthermore, in the convolution the kernel is multiplied by a specific point of images for each point in time. For the sake of efficiency we would like to have a mechanism that does this for once. With this in mind I have built a set of indexes for each scale that allow us to associate each element on the indexes of the response to its respective set of images. Also, we have a vector that associates every possible delay time in the kernel to the set of indexes in the response. We illustrate the mechanisms in the next figure

We can appreciate the three different times scales in the image. Furthermore, we have a set of indexes called delay indexes that maps each response to its respective image and also other set of indexes called delay indexes that map each of the delays to his respective response. We can create this set of indexes with the following code:

# Scale factors
input_to_image = dt / dt_stimuli # Transforms input to image
kernel_to_input = dt_kernel / dt # Transforms kernel to input
input_to_kernel = dt / dt_kernel # Transforms input to kernel

working_indexes = np.arange(Nt_simulation).astype(int)
# From here we remove the start at put the ones
remove_start = int(kernel_size * kernel_to_input)
signal_indexes = np.arange(remove_start,

# Calculate kernel
kernel_times = np.arange(kernel_size)
kernel_times = kernel_times.astype(int)

# Delay indexes
delay_indexes = np.floor(kernel_times * kernel_to_input)
delay_indexes = delay_indexes.astype(int)

# Image Indexes
stimuli_indexes = np.zeros(working_indexes.size)
stimuli_indexes = np.floor(working_indexes * input_to_image)
stimuli_indexes = stimuli_indexes.astype(int)

Now, we can calculate the response of a neuron with a center-surround receptive field by performing the convolution between its filter and the stimuli. We also plot the stimuli to see how it looks:

for index in signal_indexes:
delay = stimuli_indexes[index - delay_indexes]
# Do the calculation
signal[index] = np.sum(kernel[kernel_times,...]
* stimuli[delay,...])

t = np.arange(remove_start*dt, T_simulation, dt)
plt.plot(t, signal[signal_indexes], '-',
label='Kernel convoluted with noise')
plt.xlabel('Time (ms)')

We can see that the resolution of the response is as good as the resolution of the filter and this explains the discontinuities in the figure above.

As a sanity check we can calculate a voltage triggered average to recover the sta:

## Calculate the STA
kernel_size = kernel_times.size
Nside = np.shape(stimuli)[2]
sta = np.zeros((kernel_size ,Nside, Nside))

for tau, delay_index in zip(kernel_times, delay_indexes):
# For every tau we calculate the possible delay
# and take the appropriate image index
delay = stimuli_indexes[signal_indexes - delay_index]
# Now we multiply the voltage for the appropriate images
weighted_stimuli = np.sum( signal[signal_indexes, np.newaxis, np.newaxis] * stimuli[delay,...], axis=0)
# Finally we divide for the sample size
sta[tau,...] = weighted_stimuli / signal_indexes.size

Which we can plot in a convenient way with the following set of instructions:

## Visualize the STA
closest_square_to_kernel = int(np.sqrt(kernel_size)) ** 2

# Define the color map
cdict1 = {'red': ((0.0, 0.0, 0.0),
(0.5, 0.0, 0.1),
(1.0, 1.0, 1.0)),

'green': ((0.0, 0.0, 0.0),
(1.0, 0.0, 0.0)),

'blue': ((0.0, 0.0, 1.0),
(0.5, 0.1, 0.0),
(1.0, 0.0, 0.0))

from matplotlib.colors import LinearSegmentedColormap
blue_red1 = LinearSegmentedColormap('BlueRed1', cdict1)

n = int( np.sqrt(closest_square_to_kernel))
# Plot the filters
for i in range(closest_square_to_kernel):
plt.subplot(n,n,i + 1)
plt.imshow(sta[i,:,:], interpolation='bilinear',


by H (noreply@blogger.com) at June 20, 2014 05:15 PM

Philip Herron


TLDR: Rant post about maven, java and jenkins plugin development

So recently i done some work on the Coverity Plugin for Jenkins.
Note: I think coverity is a complete waste of money by the way for C/C++ development the latest GCC/LLVM shows you as much as you need.

I’ve got some good experience with Maven from hadoop development. I understand the problem its trying to solve huge multi-module projects have so many dependencies. The java ecosystem is just HUGE, i mean to build this plugin from a clean maven it pulls down hundreds of jar’s as dependencies. Its kind of insane.

I want as you read this blog post to remember that, all this plugin actually does is runs your jenkins build as you would expepect but runs some extra coverity tools which you need to manually have installed on your slave automatically.

But working with jenkins plugins with maven is sooo horrible. I mean because the pom.xml had issues and warnings i couldn’t figure out how to use it with intellij, the thing would never build but just give really cryptic errors. Then i just go back to emacs, commandline and run mvn compile to add in a MacroExpansionToken then to test it i have a local tomcat instance, so i have the jenkins.war already running there with a helloworld build to test the code. I run mvn clean compile install -DskipTests to create the jenkins .hpi then manually go to jenkins manage plugins and upload it. Then restart jenkins run the build and then find a bug go back to the begining.

Its so slow and painful and plugin development is sooooo awful with jenkins. I mean its the problem with java overall, making everything i mean everything so generic, yeah sure you can really easily reuse _any_ piece of code or embed some cool service into an application. But because everything has such generic interfaces programming with them you usually have really obscure abstract interfaces to work with and you need to wrap your head around before you can actually solve the problem.

I mean maven isn’t any kind of useful build tool it doesn’t even build software well, its a lump of horrible xml where you don’t even declare what you want it just automatically treats anything in src/main/{java/groovy/test/…} as a .jar unless you want a jar. This xml is really is awful to work with no documentation and a plugin for everything.

I mean if you don’t have an IDE like eclipse of intellij java development is almost just entirely impossible these days. There is just so much crap and boilerplate to fill out you won’t remember all of it to do it correctly.

It instantly forces tonnes of convention before it actually solves anything. Ant and ivy was so nice to work with before. I dont understand how maven made anything easier i mean i can sort of see it with hadoop it works pretty ok there. But even then does it?

With C/C++ or even Python or Node everything is so simple and if you need more complex systems there is alternatives for that. Everything about those languages they are very focused on solving an issue. Golang is about the only language that exists that almost reminds me of maven like a revel project or even just a simple library the dependency management is nice but it forces a fair amount of convention, but i think its bearable because its been like this always not like maven just poping along after a while and everything is quite fragmented.

I mean Java or any JVM or any .NET software is the only languages where you can spend literally weeks declaring boilerplate data-flow and object models for weeks. Before you actually write code to do what you need.

by Philip Herron at June 20, 2014 11:34 AM

June 19, 2014

Fabian Pedregosa

Surrogate Loss Functions in Machine Learning

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given $n$ training samples $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where $X_i$ belongs to some sample space $\mathcal{X}$, usually $\mathbb{R}^p$ but for the purpose of this post we can keep i abstract, and $y_i \in \{-1, 1\}$ is an integer representing the class label.

We are also given a loss function $\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$ that measures the error of a given prediction. The value of the loss function $\ell$ at an arbitrary point $(y, \hat{y})$ is interpreted as the cost incurred by predicting $\hat{y}$ when the true label is $y$. In classification this function is often the zero-one loss, that is, $\ell(y, \hat{y})$ is zero when $y = \hat{y}$ and one otherwise.

The goal is to find a function $h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function $h$ that minimizes the expected $\ell$-risk, given by $$ \mathcal{R}_{\ell}(h) = \mathbb{E}_{XY}[\ell(Y, h(X))] $$

In theory, we could directly minimize the $\ell$-risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of $XY$ is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function $\ell$ is discontinuous. Take for example a problem in which $\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function $f(X) = \text{sign}(X w), w \in \mathbb{R}^2$ and that minimizes the $\ell$-risk. As a function of the parameter $w$ this function looks something like

loss as function of w

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function $\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the hinge loss, $\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, $\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the $\Psi$-risk, given by $\mathbb{E}_{XY}[\Psi(Y, f(X))]$, of the function $f(X) = X w$ becomes a much more more tractable optimization problem:

In short, we have replaced the $\ell$-risk which is computationally difficult to optimize with the $\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the $\Psi$-risk leads to a function that also minimizes the $\ell$-risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function $\Psi$: for some functions $\Psi$ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if $\Psi$ is convex then it is consistent if and only if it is differentiable at zero and $\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

  1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 

  2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. 

by Fabian Pedregosa at June 19, 2014 10:00 PM

June 18, 2014


GSoC Open Source Brain: Retinal Filter I

LGN-Retinal filter

Treating the retina and the LGN together as a spatio-temporal filter is a common trait in the models that we are going to work with in this project. In this series of articles I am going to develop and explain the mechanism that I have developed to model this particular stage of processing.

Structure of the filter

In this particular example we are dealing with a separable spatio-temporal filter. That is, we can write our filter as a product of the spatial part and the temporal part. In the following sections, we will describe them in this given order.

Spatial filter

In order to calculate the spatial filter we need two things. First, how wide our filters are going to be and then the resolution level. So lx and ly will stand for the size of the receptive filters (in degrees) for the x and y direction respectively. The same will go for dx and dy with respect to the resolution. With this in our hands we then can proceed to create a 2 dimensional structure for our functions with the help of the mesh functions:

## Spatial parameters
x = np.arange(-lx/2, lx/2, dx)
y = np.arange(-ly/2, ly/2, dy)

X, Y = np.meshgrid(x, y) # Create the 2D dimensional coordinates

Now that we have the function it is just a matter of calculating the function with the appropiate formula:

R = np.sqrt(X**2 + Y**2) # Distance
center = (17.0 / sigma_center**2) * np.exp(-(R / sigma_center)**2)
surround = (16.0 / sigma_surround**2) * np.exp(-(R / sigma_surround)**2)
Z = surround - center

With the formula in our hands we can plot it as a countour line to have an idea of how it looks like in space:
# Plot countour map
plt.contourf(X, Y, Z, 50, alpha=0.75, cmap=plt.cm.hot)

C = plt.contour(X, Y, Z, 10, colors='black', linewidth=.5)
plt.clabel(C, inline=10, fontsize=10)

We can also show a view when y=0 in order to offer a side view of the plot in order to gain further insight in the structure of it:

Temporal filter

The behaviour of the temporal filters in time is usually called bi-phasic response. This consists in a response that initially grows in time for the mean level but then it goes bellow to the mean level to finally return at it in time window of the order 200ms approximately. The particular function and parameters that we are going to use to illustrate this behaviour were taken from the paper by Cai et Al (Reference bellow). But first, we have to define our kernel size and resolution as in the code down here:

# First the kernel size and resolution
kernel_size = 200
dt_kernel = 1.0
t = np.arange(0, kernel_size * dt_kernel, dt_kernel) # Time vector

Now, we construct the function in the following way:
## Temporal parameters
K1 = 1.05
K2 = 0.7
c1 = 0.14
c2 = 0.12
n1 = 7.0
n2 = 8.0
t1 = -6.0
t2 = -6.0
td = 6.0

p1 = K1 * ((c1*(t - t1))**n1 * np.exp(-c1*(t - t1))) / ((n1**n1) * np.exp(-n1))
p2 = K2 * ((c2*(t - t2))**n2 * np.exp(-c2*(t - t2))) / ((n2**n2) * np.exp(-n2))
p3 = p1 - p2

We plot the function to have an idea of how it looks now:

Spatio-Temporal Filter

Finally, in order to have build the kernel, given that our filter is separable, we just have to multiply the temporal the temporal part by the spatial part at each point in space:

# Initialize and fill the spatio-temporal kernel
kernel = np.zeros((kernel_size, int(lx/dx), int(ly/dy)))

for k, p3 in enumerate(p3):
kernel[k,...] = p3 * Z

Which we can show now in the following plot:


  • Cai, Daqing, Gregory C. Deangelis, and Ralph D. Freeman. "Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens." Journal of Neurophysiology 78.2 (1997): 1045-1061.

by H (noreply@blogger.com) at June 18, 2014 07:24 PM

Continuum Analytics

Binstar is Out of Beta, Now Available for Everyone

I am very excited to announce the public availability of Binstar. Binstar is a free service for hosting public packages for pip and conda with support for R, npm, and other formats in the near future. Private packages can also be hosted for a nominal monthly subscription. An on-premise version of Binstar is also available as part of Anaconda Server.

by Travis Oliphant at June 18, 2014 12:30 PM

June 16, 2014

Titus Brown

Being a release manager for khmer

We just released khmer v1.1, a minor version update from khmer v1.0.1 (minor version update: 220 commits, 370 files changed).

Cancel that -- _I_ just released khmer, because I'm the release manager for v1.1!

As part of an effort to find holes in our documentation, "surface" any problematic assumptions we're making, and generally increase the bus factor of the khmer project, we have been switching up the release manager for every 1.0+ release. The first release manager (for v1.0) was Michael Crusoe (@biocrusoe); the second (for v1.0.1) was (@luizirber); and I'm the third, for v1.1.

Each time through, we make our release process docs more explicit and more correct. There's really no better way to do it; each RM comes with a different background and a different skillset, so by the time four or five people have cut a release, we should have ironed out any wrinkles.

Roughly speaking, our release process consists of:

  1. Building a release candidate
  2. Testing the release candidate in multiple ways using virtualenvs on a Linux box.
  3. Pushing to the test PyPI server and doing install tests in a virtualenv.
  4. Doing multi-system tests on the BaTLab, and running our acceptance tests on Amazon Web Services.
  5. If that all passes for a release candidate, cutting the final release!

There are about a dozen steps in all, with 40-50 command line steps. It's a bit complicated but it's all codified in a somewhat cockamamie semi-automated copy-and-paste doc that actually works pretty well.

For me, the big challenge was getting the BaTLab multi-platform install tests to run; this required about 30 minutes of handholding by Michael. Once they ran we discovered a few problems, the biggest of which was a breakage of Python 2.6 compatibility -- which we simply wouldn't have found otherwise.

For the next release manager, the challenge will be to get through the acceptance tests; for v1.1, we added acceptance tests based on our metagenome protocol, so we now test about 90% of our software's command line interface on real data before releasing it. But since I'm the person in charge of the acceptance testing system, there are a few tricks and tips that I haven't completely codified, so I'll have to work on pushing those out the door.

The main lesson I've learned from all of this is that you catch a lot more bugs with all this testing than you would any other way. It will hopefully result in a more pleasant user experience as people find fewer and fewer "dumb" bugs, and it certainly is giving us more confidence in our software processes' robustness as we contemplate some bigger changes to khmer down the line.

Something that's also rather exciting is that we have three people who aren't part of the lab contributing code -- @chuckpr, @accaldwell, and @znruss. As our process becomes more explicit, I think we're attracting people who want to contribute but (6 months ago) wouldn't have known how.

What's up next on the khmer timeline? We're participating in the July Mozilla Science Labs hackathon by offering a khmer contributathon so people can try out our workflow, and we're going to revamp the docs for that. We'll let you know how it goes.


by C. Titus Brown at June 16, 2014 10:00 PM

June 14, 2014

Jake Vanderplas

Frequentism and Bayesianism IV: How to be a Bayesian in Python

I've been spending a lot of time recently writing about frequentism and Bayesianism.

Here I want to back away from the philosophical debate and go back to more practical issues: in particular, demonstrating how you can apply these Bayesian ideas in Python. The workhorse of modern Bayesianism is the Markov Chain Monte Carlo (MCMC), a class of algorithms used to efficiently sample posterior distributions.

Below I'll explore three mature Python packages for performing Bayesian analysis via MCMC:

  • emcee: the MCMC Hammer
  • pymc: Bayesian Statistical Modeling in Python
  • pystan: The Python Interface to Stan

I won't be so much concerned with speed benchmarks between the three, as much as a comparison of their respective APIs. This post is not meant to be a tutorial in any of the three; each of them is well documented and the links above include introductory tutorials for that purpose. Rather, what I want to do here is a side-by-side comparison which will give a feel for how each package is used. I'll propose a single relatively non-trivial test problem, and show the implementation and results of this problem using all three packages. Hopefully by seeing the three approaches side-by-side, you can choose which package might be best suited for your particular application.

Test Problem: Line of Best Fit

For our test problem, we'll do a three-parameter model which fits a straight line to data. The parameters will be the slope, the intercept, and the scatter about the line; the scatter in this case will be treated as a nuisance parameter.

The Data

Let's define some data that we'll work with:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata

# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)
In [2]:
plt.plot(xdata, ydata, 'ok')

The data are clearly correlated, and we'll assume that we don't know the errors. Let's construct a linear model to fit this data.

The Model: Slope, Intercept, & Unknown Scatter

Recall that Bayes' theorem gives

\[ P(\theta~|~D) \propto P(D~|~\theta) P(\theta) \]

Where \(D\) represents the observed data, and \(\theta\) represents the model.

We'll assume a linear model for our data, parametrized by a slope \(\beta\) and a y-intercept \(\alpha\):

\[ \hat{y}(x_i~|~\alpha,\beta) = \alpha + \beta x_i \]

Assuming gaussian errors on the observed y values, the probability for any data point under this model is given by:

\[ P(x_i, y_i~|~\alpha, \beta, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[\frac{-[y_i - \hat{y}(x_i~|~\alpha, \beta)]^2}{2\sigma^2}\right] \]

where \(\sigma\) here is an unknown measurement error, which we'll treat as a nuisance parameter

Multiplying these for all \(i\) gives the likelihood: \[ P(\{x_i\}, \{y_i\}~|~\alpha, \beta, \sigma) \propto (2\pi\sigma^2)^{-N/2} \exp\left[- \frac{1}{2\sigma^2} \sum_{i-1}^N [y_i - \hat{y}(x_i~|~\alpha,\beta)]^2\right] \]

The Prior

Here we're going to be a bit more careful about the choice of prior than we've been in the previous posts. We could simply choose flat priors on \(\alpha\), \(\beta\), and \(\sigma\), but we must keep in mind that flat priors are not always uninformative priors! A better choice is to follow Jeffreys and use symmetry and/or maximum entropy to choose maximally noninformative priors. This is the kind of exercise in choosing priors is one that frequentists often complain about as overly subjective, but the approach is well-founded and very well-supported from an information theoretic standpoint.

Why might a flat prior a bad choice? This is perhaps easiest to see in the case of slope. Let's visualize this by plotting some lines with slopes evenly-spaced between 0 and 10:

In [3]:
fig, ax = plt.subplots(subplot_kw=dict(aspect='equal'))
x = np.linspace(-1, 1)

for slope in np.arange(0, 10, 0.1):
    plt.plot(x, slope * x, '-k')

ax.axis([-1, 1, -1, 1], aspect='equal');

These lines have evenly-spaced slopes in units of 0.1, yet the higher slopes are bunched together. With a flat prior, you're essentially saying that any one of these slopes is just as likely as another. Due to this bunching seen above, it's clear that a flat prior on slope will highly favor very steep slopes! A flat prior on slope is not a minimally informative prior, and may end up biasing your result (though with enough data the effect is almost zero).

You might imagine coming up with a better scheme by-hand (perhaps use a flat prior on the angle \(\theta\) between the line and the x-axis) but we can be even more rigorous. The following problem has been well-explored in the Bayesian literature; the best resource I've found is a paper by Jaynes: Straight Line Fitting: A Bayesian Solution (pdf)

Prior on Slope and Intercept

If our model is given by

\[ y = \alpha + \beta x \]

then we can construct a parameter-space probability element \(P(\alpha, \beta) ~d\alpha ~d\beta\).

Because \(x\) and \(y\) are symmetric, we could just as easily use another set of parameters

\[ x = \alpha^\prime + \beta^\prime y \]

with probability element \(Q(\alpha^\prime, \beta^\prime)d\alpha^\prime d\beta^\prime\), where it's easy to show that

\[ (\alpha^\prime,~\beta^\prime) = (- \beta^{-1}\alpha,~\beta^{-1}). \]

From the Jacobian of the transformation, we can show that

\[ Q(\alpha^\prime, \beta^\prime) = \beta^3 P(\alpha, \beta). \]

Maintaining the symmetry of the problem requires that this change of variables should not affect the prior probability, so we can write:

\[ \beta^3 P(\alpha, \beta) = P(- \beta^{-1}\alpha, \beta^{-1}). \]

This is a functional equation which is satisfied by

\[ P(\alpha, \beta) \propto (1 + \beta^2)^{-3/2}. \]

which is equivalent to saying that \(\alpha\) is uniformly distributed, and \(\beta\) is distributed uniformly in \(\sin\theta\) where \(\theta = \tan^{-1}\beta\).

This might surprise you that the slopes are distributed according to \(\sin\theta\) rather than uniformly in \(\theta\). This \(\sin\theta\) term, though, can actually be thought of as coming from the intercept! If we change variables from \(\alpha\) to \(\alpha_\perp = \alpha\cos\theta\), then it's straightforward to show that our variables are uniformly distributed in \((\alpha_\perp,~\theta)\). We'll make use of this fact in the PyStan solution below.

Prior on \(\sigma\)

Similarly, we want the prior on \(\sigma\) to be invariant to rescalings of the problem (i.e. changing units). So our probability must satisfy

\[ P(\sigma)d\sigma = P(\sigma / c)d\sigma / c. \]

This is a functional equation satisfied by

\[ P(\sigma) \propto 1 / \sigma. \]

This is known as the Jeffreys Prior, after Harold Jeffreys.

Putting our Priors Together

Putting these together, we see that symmetry arguments have led to the following minimally informative prior on our model:

\[ P(\alpha, \beta, \sigma) \propto \frac{1}{\sigma}(1 + \beta^2)^{-3/2} \]

As alluded to above, you could equivalently address this by using flat priors on transformed parameters, namely \((\alpha, \beta, \sigma) \to (\alpha_\perp, \theta, \log\sigma)\), but I personally think the symmetry/maximum entropy approach is cleaner and clearer – it also gives us a chance to demonstrate the definition of nontrivial custom priors within the three packages.

Solving This Problem in Python

Now that we have the data, the likelihood, and the prior, let's show how to solve this problem in Python using emcee, PyMC, and PyStan. First, though, let's quickly define some convenience routines which will help us visualize the results:

In [4]:
# Create some convenience routines for plotting

def compute_sigma_level(trace1, trace2, nbins=20):
    """From a set of traces, bin by number of standard deviations"""
    L, xbins, ybins = np.histogram2d(trace1, trace2, nbins)
    L[L == 0] = 1E-16
    logL = np.log(L)

    shape = L.shape
    L = L.ravel()

    # obtain the indices to sort and unsort the flattened array
    i_sort = np.argsort(L)[::-1]
    i_unsort = np.argsort(i_sort)

    L_cumsum = L[i_sort].cumsum()
    L_cumsum /= L_cumsum[-1]
    xbins = 0.5 * (xbins[1:] + xbins[:-1])
    ybins = 0.5 * (ybins[1:] + ybins[:-1])

    return xbins, ybins, L_cumsum[i_unsort].reshape(shape)

def plot_MCMC_trace(ax, xdata, ydata, trace, scatter=False, **kwargs):
    """Plot traces and contours"""
    xbins, ybins, sigma = compute_sigma_level(trace[0], trace[1])
    ax.contour(xbins, ybins, sigma.T, levels=[0.683, 0.955], **kwargs)
    if scatter:
        ax.plot(trace[0], trace[1], ',k', alpha=0.1)
def plot_MCMC_model(ax, xdata, ydata, trace):
    """Plot the linear model and 2sigma contours"""
    ax.plot(xdata, ydata, 'ok')

    alpha, beta = trace[:2]
    xfit = np.linspace(-20, 120, 10)
    yfit = alpha[:, None] + beta[:, None] * xfit
    mu = yfit.mean(0)
    sig = 2 * yfit.std(0)

    ax.plot(xfit, mu, '-k')
    ax.fill_between(xfit, mu - sig, mu + sig, color='lightgray')


def plot_MCMC_results(xdata, ydata, trace, colors='k'):
    """Plot both the trace and the model together"""
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
    plot_MCMC_trace(ax[0], xdata, ydata, trace, True, colors=colors)
    plot_MCMC_model(ax[1], xdata, ydata, trace)

Without further ado, let's do some MCMC.


The emcee package (also known as MCMC Hammer, which is in the running for best Python package name in history) is a Pure Python package written by Astronomer Dan Foreman-Mackey. It is a lightweight package which implements a fairly sophisticated Affine-invariant Hamiltonian MCMC. Because the package is pure Python (i.e. it contains no compiled extensions) it is extremely easy to install; with pip, simply type at the command-line

[~]$ pip install emcee

Emcee does not have much specific boilerplate code; it simply requires you to pass it a Python function which returns a value proportional to the log-posterior probability, and returns samples from that posterior. Here's how we solve the above problem with emcee:

In [5]:
import emcee

In [6]:
# Define our posterior using Python functions
# for clarity, I've separated-out the prior and likelihood
# but this is not necessary. Note that emcee requires log-posterior

def log_prior(theta):
    alpha, beta, sigma = theta
    if sigma < 0:
        return -np.inf  # log(0)
        return -1.5 * np.log(1 + beta ** 2) - np.log(sigma)

def log_likelihood(theta, x, y):
    alpha, beta, sigma = theta
    y_model = alpha + beta * x
    return -0.5 * np.sum(np.log(2 * np.pi * sigma ** 2) + (y - y_model) ** 2 / sigma ** 2)

def log_posterior(theta, x, y):
    return log_prior(theta) + log_likelihood(theta, x, y)
In [7]:
# Here we'll set up the computation. emcee combines multiple "walkers",
# each of which is its own MCMC chain. The number of trace results will
# be nwalkers * nsteps

ndim = 3  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 1000  # "burn-in" period to let chains stabilize
nsteps = 2000  # number of MCMC steps to take

# set theta near the maximum likelihood, with 
starting_guesses = np.random.random((nwalkers, ndim))
In [8]:
# Here's the function call where all the work happens:
# we'll time it using IPython's %time magic

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[xdata, ydata])
%time sampler.run_mcmc(starting_guesses, nsteps)
CPU times: user 5.73 s, sys: 13 ms, total: 5.75 s
Wall time: 5.76 s

In [9]:
# sampler.chain is of shape (nwalkers, nsteps, ndim)
# we'll throw-out the burn-in points and reshape:
emcee_trace = sampler.chain[:, nburn:, :].reshape(-1, ndim).T
plot_MCMC_results(xdata, ydata, emcee_trace)

On the left we show the resulting traces marginalized over the nuisance parameter \(\sigma\). On the right, we show the line of best-fit along with the 2-\(\sigma\) uncertainty region. This is exactly the type of result we expect from MCMC: marginalized uncertainty contours around a model which provides a good by-eye fit to the data.


The PyMC package has many more features than emcee, including built-in support for efficient sampling of common prior distributions. PyMC by default uses the classic Metropolis-Hastings sampler, one of the earliest MCMC algorithms. For performance, it uses compiled fortran libraries, so it is less trivial to install using tools like pip. My machine has a working fortran compiler, so pip install pymc worked without a problem (but from working with students, colleagues, and tutorial attendees, I can tell you that few scientists today have a system setup such that this will work out-of-the-box). For folks who don't have a fortran compiler installed, PyMC binaries for many systems can be quite easily installed with conda.

I should mention that the future PyMC version 3 removes fortran dependence and makes the installation much more streamlined; I've also been told that the API of PyMC 3 is much cleaner, and that performance is much better. Because PyMC 3 is still listed as an alpha release, I've decided to stick with the current supported release for this post:

In [10]:
import pymc

In [11]:
# Define the variables needed for the routine, with their prior distributions
alpha = pymc.Uniform('alpha', -100, 100)

def beta(value=0):
    return -1.5 * np.log(1 + value ** 2)

def sigma(value=1):
    return -np.log(abs(value))

# Define the form of the model and likelihood
def y_model(x=xdata, alpha=alpha, beta=beta):
    return alpha + beta * x

y = pymc.Normal('y', mu=y_model, tau=1. / sigma ** 2, observed=True, value=ydata)

# package the full model in a dictionary
model1 = dict(alpha=alpha, beta=beta, sigma=sigma,
              y_model=y_model, y=y)
In [12]:
# run the basic MCMC: we'll do 100000 iterations to match emcee above
S = pymc.MCMC(model1)
S.sample(iter=100000, burn=50000)

 [-                 3%                  ] 3506 of 100000 complete in 0.5 sec
 [--                6%                  ] 6801 of 100000 complete in 1.0 sec
 [---              10%                  ] 10108 of 100000 complete in 1.5 sec
 [-----            13%                  ] 13510 of 100000 complete in 2.0 sec
 [------           16%                  ] 16684 of 100000 complete in 2.5 sec
 [-------          20%                  ] 20018 of 100000 complete in 3.0 sec
 [--------         23%                  ] 23384 of 100000 complete in 3.5 sec
 [----------       26%                  ] 26671 of 100000 complete in 4.0 sec
 [-----------      30%                  ] 30000 of 100000 complete in 4.5 sec
 [------------     33%                  ] 33252 of 100000 complete in 5.0 sec
 [-------------    36%                  ] 36635 of 100000 complete in 5.5 sec
 [---------------  40%                  ] 40046 of 100000 complete in 6.0 sec
 [---------------- 43%                  ] 43430 of 100000 complete in 6.5 sec
 [-----------------46%                  ] 46732 of 100000 complete in 7.0 sec
 [-----------------50%                  ] 50059 of 100000 complete in 7.5 sec
 [-----------------52%                  ] 52746 of 100000 complete in 8.0 sec
 [-----------------55%-                 ] 55446 of 100000 complete in 8.5 sec
 [-----------------58%--                ] 58070 of 100000 complete in 9.0 sec
 [-----------------60%--                ] 60486 of 100000 complete in 9.5 sec
 [-----------------63%---               ] 63079 of 100000 complete in 10.0 sec
 [-----------------65%----              ] 65668 of 100000 complete in 10.5 sec
 [-----------------68%-----             ] 68369 of 100000 complete in 11.0 sec
 [-----------------71%------            ] 71008 of 100000 complete in 11.5 sec
 [-----------------73%-------           ] 73672 of 100000 complete in 12.0 sec
 [-----------------76%--------          ] 76262 of 100000 complete in 12.5 sec
 [-----------------78%---------         ] 78660 of 100000 complete in 13.0 sec
 [-----------------81%----------        ] 81231 of 100000 complete in 13.5 sec
 [-----------------83%-----------       ] 83790 of 100000 complete in 14.0 sec
 [-----------------86%------------      ] 86407 of 100000 complete in 14.5 sec
 [-----------------89%-------------     ] 89101 of 100000 complete in 15.0 sec
 [-----------------91%--------------    ] 91789 of 100000 complete in 15.5 sec
 [-----------------94%---------------   ] 94391 of 100000 complete in 16.0 sec
 [-----------------97%----------------  ] 97081 of 100000 complete in 16.5 sec
 [-----------------99%----------------- ] 99649 of 100000 complete in 17.0 sec
 [-----------------100%-----------------] 100000 of 100000 complete in 17.1 sec
In [13]:
# extract the traces and plot the results
pymc_trace = [S.trace('alpha')[:],

plot_MCMC_results(xdata, ydata, pymc_trace)

We get traces very similar to those provided by emcee.


The PyStan project is the official Python wrapper of the Stan Probabilistic programming language, which is implemented in C++. It uses a No U-Turn Sampler, which is more sophisticated than classic Metropolis-Hastings or Gibbs sampling. As far as API goes, the important difference between PyStan as compared to emcee and PyMC is that it requires you to write and compile non-Python code within your Python script when defining your model.

Because PyStan depends on the Stan package, installation can be difficult. In particular, you need the full Stan library to be installed in order to use the python wrapper. If you have a well-set-up environment for compiling C/C++ code, this shouldn't be a problem – using pip install pystan downloaded, compiled, and installed everything necessary on my system.

From what I could find online, it doesn't appear that the Stan team provides pre-built binaries, including for conda. Because of this, if you don't have your system set up for C/C++ compilation, installation will probably be more difficult.

I'm using version 2.2 of PyStan:

In [14]:
import pystan

There is one more hiccup with this package: for some reason, PyStan runs extremely slowly and crashes my computer when I run it in the IPython notebook itself, but works without issues when I run it as an executable Python file. I'm not sure why this is but it may have something to do with this issue. To get around this, I'll save the PyStan script as a standalone Python file and execute it from the command-line, writing the resulting traces to disk:

In [15]:
%%file pystan_example.py

import numpy as np
import pystan

# Generate data (same as used in the notebook)

theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata

# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)

# Create the Stan model
#  this is done by defining a string of Stan code.

fit_code = """
data {
    int<lower=0> N; // number of points
    real x[N]; // x values
    real y[N]; // y values

parameters {
    real alpha_perp;
    real<lower=-pi()/2, upper=pi()/2> theta;
    real log_sigma;

transformed parameters {
    real alpha;
    real beta;
    real sigma;
    real ymodel[N];
    alpha <- alpha_perp / cos(theta);
    beta <- sin(theta);
    sigma <- exp(log_sigma);
    for (j in 1:N)
    ymodel[j] <- alpha + beta * x[j];

model {
    y ~ normal(ymodel, sigma);

# perform the fit
fit_data = {'N': len(xdata), 'x': xdata, 'y': ydata}
fit = pystan.stan(model_code=fit_code, data=fit_data, iter=25000, chains=4)

# extract the traces
traces = fit.extract()
pystan_trace = [traces['alpha'], traces['beta'], traces['sigma']]

# save the traces with numpy
np.save("pystan_trace.npy", pystan_trace)
Overwriting pystan_example.py

In [16]:
# run the code we've created on the command-line
!python pystan_example.py
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_c1dba2ed7f485b674d7ce5eb738ffe05 NOW.

Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:     1 / 25000 [  0%]  (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Scale parameter is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.664267 seconds (Warm-up)
              0.724006 seconds (Sampling)
              1.38827 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.6626 seconds (Warm-up)
              0.781297 seconds (Sampling)
              1.4439 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.669715 seconds (Warm-up)
              0.799694 seconds (Sampling)
              1.46941 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.667423 seconds (Warm-up)
              0.795613 seconds (Sampling)
              1.46304 seconds (Total)

As we can see, the execution takes ~6 sec in serial to draw 100,000 samples. Additionally, on my machine, the compilation phase takes about ~20 seconds before the model is run.

In [17]:
# load the results from file; plot as above
pystan_trace = np.load('pystan_trace.npy')
plot_MCMC_results(xdata, ydata, pystan_trace)

Again, the results look similar to what we've seen above.

Summary: Comparing the Three Methods

As a more direct comparison of the results, let's quickly over-plot the contours derived by the three methods:

In [18]:
fig, ax = plt.subplots(figsize=(8, 8))
plot_MCMC_trace(ax, xdata, ydata, emcee_trace, True,
                colors='blue', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pymc_trace,
                colors='red', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pystan_trace,
                colors='green', linewidths=2)
ax.legend(ax.collections[::2], ['emcee', 'pymc', 'pystan'], fontsize=16);

As we would expect, the results agree! This indicates that we've defined the models consistently in all three packages. Additionally, we see that the "true" values used to generate the distribution of points (25, 0.5) fall within the 1-\(\sigma\) error ellipse.

Evaluating the Packages

So which package is the best? The answer to that will likely depend on what your problem is. Here's a quick table summarizing my own impressions of the three packages:

Complexity Execution time (100,000 samples; includes burn-in) Ease of Installation Learning Curve/Ease of Use Number of Features
emcee Very lightweight ~6 sec Pure python; installs easily with pip Straightforward & Pythonic not much built-in beyond basic MCMC sampling
pymc2 Lots of functionality & options ~17 sec Requires fortran compiler; binaries available via conda Pythonic, but lots of pymc-specific boilerplate Lots of built-in functionality in Python
pystan Large package; requires coding in Stan language ~20 sec compilation + ~6 sec computation Requires C compiler + Stan installation; no binaries available Not pure Python; must learn Stan model specification language Lots of built-in functionality in Stan-specific language

More verbosely:

emcee is extremely lightweight, and that gives it a lot of power. All you need to do is define your log-posterior (in Python) and emcee will sample from that distribution. Because it's pure-Python and does not have specially-defined objects for various common distributions (i.e. uniform, normal, etc.) I thought it might be slower than the others, but its performance on this problem is impressive. This is perhaps due to the more sophisticated default sampling scheme, so the benchmark may not be a fair comparison.

pymc is more full-featured, and once you get the hang of the decorator syntax for defining variables, distributions, and derived quantities, it is very easy to use. Its performance lagged the other two: the same query took several times longer, despite having optimized objects for sampling from various priors. From what I hear, though, the still-in-alpha PyMC version 3 – a complete rewrite of the package – blows PyMC version 2 out of the water.

pystan is the most difficult of the three to use, but that's because it's not really a Python package. Models are specified not in Python, but in a custom statistical expression language. This language is very powerful, but has a steep learning curve. The ~20sec compilation time for each model is annoying, but I suspect that as models get more complicated and sample sizes grow, this won't seem like such an issue. The fact that Stan is specifically designed for this type of operation, and the fact that all its models compile directly to byte code, makes it seem like a reasonable choice for large, specialized computations.

I wouldn't put much weight on the timings here; as mentioned above, the packages use very different default sampling schemes (Metropolis-Hastings for PyMC, no U-Turn sampling for PyStan, and affine-invariant ensemble sampling for emcee) so the times are not directly comparable. Each of these sampling schemes has its advantages and disadvantages, which you can read about in the links in above sections.

I hope that I've given a fair comparison of these packages here. Note that I'm not an expert or contributor to any of these packages: I've simply read the online documentation and adapted the tutorial examples to the specific problem I proposed here. If any power users or package developers are reading this and have feedback about how to better use or represent any of these packages, I'd love to hear your response! I'd say two things:

  1. Please write a blog comment and let me know how to use your package more effectively!
  2. Please realize that the average user likely uses your package as I did here: by reading the intro tutorials, and adapting those examples to their own problems. If that has led me to do something silly, it's likely that many of your users are making the same silly mistake!

Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view on nbviewer.

by Jake Vanderplas at June 14, 2014 04:30 PM

Paul Ivanov

starting my job search

I am starting to look for a job in the San Francisco Bay Area.

Since many recruiters ask for and presumably look at GitHub profiles, I decided to give mine a little facelift:

Smart and Gets Things Done Github Contribution

In case you aren't familiar, that banner was motivated by Joel Spolsky's Smart and Gets Things Done, which is a book about hiring good developers . So I decided to tweet it out, mentioning @spolsky and he favorited it!

Yesterday, I decided to tweet out an image that's at the top of my resume as a standalone tweet- mentioning Joel Spolsky again, and he liked it well enough to retweet it to his 90 thousand followers, so it's been getting plenty of love.

Paul Ivanov's Visual Resume

Perhaps unsurprisingly, the only person to contact me as a result of this so far is a reporter from Business Insider :

My editor would like to post it on our site as an example of a creative way to format a resume... I'm wondering if we can get your permission to do this?

So that's what prompted this post: I simply added my name and a Creative Commons Attribution Licence (CC-BY) to the two images, and then sent my permission along.

Outside of that, no prospective employers have gotten in touch. But like I always say: you can't win the lottery if you don't buy a ticket. And since I also enjoy mixing metaphors, I'll just keep on fishing!

by Paul Ivanov at June 14, 2014 07:00 AM

Maheshakya Wijewardena

Performance evaluation of Approximate Nearest Neighbor search implementations - Part 2

This continues from the post Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1. The evaluation about the memory consumption is already completed in that post. In this post, the next two aspects of the evaluation framework, precision and query speed will be discussed.

When measuring the performance of Approximate nearest neighbor search methods, expressing precision and the query speed independently is less useful since the entire idea of approximating is to obtain the desired precision within a better time period. In order to evaluate precision, an ANN implementation should be able to provide a multiple number of neighbors(rather than just the nearest neighbor). After obtaining all data points in the data set from the ANN method, first few entires in the that neighbors list(ten neighbors in my evaluation tests) are taken as the neighbors of ground truth of that particular ANN method. These set of data points are compared against neighbors retrieved when the number of queried neighbors is varied. Precision tests are adapted from the tests performed for ANNOY.

This precision measure eliminates some of our candidate ANN implementations because those are not capable of producing a multiple number of neighbors. Obtaining multiple neighbors is an essential requirement of for precision tests described above as well the general applications of nearest neighbor search. Therefore, for the precision tests, only ANNOY, FLANN, KGraph and LSH Forest are taken into consideration. All evaluation tests, LSH forest implementation and the plots can be found in this Github repository.

Before jumping into comparisons, I thought it is imperative to get a notion on the characteristics of the LSH forest implementation. Unlike other ANN implementations, LSH forest provides some user controlled parameters to tune the forest and queries in different scenarios.

For the entire forest:

  • Number of trees
  • Maximum hash length

For queries:

  • c : A value which determines the number of candidates chosen into the neighbors set. This acts with the number of trees.
  • A lower bound for the maximum depth of hashes considered.

In these precision tests, all the those factors but c are fixed to constant values as follows:

  • Number of trees = 10
  • Maximum hash length = 32
  • Lower bound = 4

The same data set which has been prepared using singular value decomposition on movielens data is used in all of these tests. Following are the resulting graphs of the performance of LSH forest. Time is measured in seconds. precision vs c LSHF precision vs time LSHF precision vs number of candidates LSHF number of candidates vs c LSHF

The next section of this post illustrates the performance comparisons among ANNOY, FLANN, LSH forest and KGraph. Precision vs. time graphs are used for this comparison. precision vs time LSHF and ANNOY precision vs time FLANN & LSHF Comparing ANNOY, FLANN and LSHF in one place: precision vs time LSHF, ANNOY & FLANN

KGraph has a significantly higher precision rate than other ANN implementations.(Rather than approximating, it gives the actual nearest neighbors according the KGraph documentation ) precision vs time LSHF & KGraph

One of the main considerations of these evaluations is the maintainability of the code. Therefore, any implementation which goes into scikit-learn should have reasonably less complex code. Both FLANN and KGraph use complex data structures and algorithms to achieve higher speeds. ANNOY has a reasonably passable precision-query speed combination with a less complex implementation. Our current implementation of LSH forest has been able to beat ANNOY in precision-query speed comparison.

Indexing speeds of ANN implementations

In addition to precision and query speed, a measure of indexing speed has a major importance. Therefore as a final step for this evaluation, I will provide you a description on indexing speed for the same data set used above for each ANN implementation.

  • KGraph : 65.1286380291 seconds
  • ANNOY (metric='Angular') : 47.5299789906 seconds
  • FLANN : 0.314314842224 seconds
  • LSHF : 3.73900985718 seconds

In my next post, I will discuss about the implementation of LSH forest in detail and how ANN methods will be implemented in scikit-learn.


  1. ANN : Approximate Nearest Neighbors
  2. LSH : Locality Sensitive Hashing

June 14, 2014 12:00 AM

June 13, 2014

Manoj Kumar


Hi, The first part of the last few days of have been the most productive and the last part well not so much.

1. Fixing precompute for ElasticNetCV
The function argument precompute=”auto” was being unused, in ElasticNetCV as mentioned in my previous post. Setting precompute equals auto uses a gram variant of the input matrix which according to the documentation is np.dot(X.T, X) . This theoritically helps the descent algorithm to converge faster. (However at the time of writing, I do not know exactly how). Practically however, (and after testing using the line profiler) it seems to be a bit slower since the computation of the gram matrix takes quite a bit of time. So with the advice of ogrisel, I split it across three Pull Requests. All three are essentially easy to fix.

1. https://github.com/scikit-learn/scikit-learn/pull/3247 – This ensures that the Gram variant is used if precompute is said to True or (auto and if n_samples > n_features)
2. https://github.com/scikit-learn/scikit-learn/pull/3248 – Remove precompute from Multi Task models since it is being unused,
3. https://github.com/scikit-learn/scikit-learn/pull/3249 – This is a WIP, and changes default precompute from auto to False

2. Threading backend for Linear Models.
I have successfully changed the backend from muli-processing to threading after releasing the GIL for all four variants. After a final round of review it can be merged,
a] Simple coordinate descent
b] Sparse coordinate descent
c] Gram variant
d] MultiTask variant
There is a huge memory gain and speed is almost the same (if not slightly higher by this Pull Request). https://github.com/scikit-learn/scikit-learn/pull/3102

3. Logistic Regression CV
Reading someone else’s magically vectorised NumPy code isn’t an easy task and I somehow crawled my way through it (which explains the more productive first part).

I fixed a bug in the code to compute the Hessian when the intercept is true. I’ve also fixed sparse matrix support and added multiple tests to it and confirmted that the newton-cg and lbfgs solvers give the exact same result, The liblinear has a slight change due to the penalisation of the intercept.

However benchmarking gives ambiguous results. On standard datasets such as the newsgroup and digits data, almost always the lib-linear solver is the fastest. However in datasets using make_classification, lbfgs seems to be the faster solver.

Right now, my job is just to wait for comments from Alex and Olivier and make the necessary changes. I shall come up with a more detailed description on Log Reg CV next week.

by Manoj Kumar at June 13, 2014 08:05 AM

Issam Laradji

(Week 3) GSoC 2014: Extending Neural Networks Module for Scikit learn

This week, with the help of many reviewers I completed a user friendly multi-layer perceptron algorithm in Python. While it is still a Pull Request , the algorithm can be downloaded by following these steps,

1) git clone https://github.com/scikit-learn/scikit-learn
2) cd scikit-learn/
3) git fetch origin refs/pull/3204/head:mlp
4) git checkout mlp

Creating an MLP classifier is easy. First, import the scikit-learn library; then, initialize an MLP classifier by executing these statements,

  from sklearn.neural_network import MultilayerPerceptronClassifier
  clf = MultilayerPerceptronClassifier()

If you'd like to have 3 hidden layers of sizes 150-100-50, create an MLP object using this statement,

   clf = MultilayerPerceptronClassifier(n_hidden=[150, 100, 50])

Training and testing are done the same way as any learning algorithm in scikit-learn.

In addition, you can tune many parameters for your classifier. Some of the interesting ones are,

1) the algorithm parameter, which allows users to select the type of algorithm for optimizing the neural network weights, which is either stochastic gradient descent SGD and l-bfgs; and

2) the max_iter parameter, which allows users to set the number of maximum iterations the network can run.

After training mlp, you can view the minimum cost achieved by printing mlp.cost_. This gives an idea of how well the algorithm has trained.

The implementation passed high standard tests and achieved expected performance for the MNIST dataset. MLP with one hidden layer of 200 neurons, and 400 iterations achieved great results in the MNIST benchmark compared to other algorithms shown below,

 Classification performance                                                         
 Classifier                         train-time                          test-time                       error-rate
  Multi layer Perceptron         655.5s                            0.30s                            0.0169
  nystroem_approx_svm         125.0s                            0.91s                            0.0239
  ExtraTrees                             79.9s                           0.34s                            0.0272
  fourier_approx_svm             148.9s                            0.60s                            0.0488
  LogisticRegression                68.9s                            0.14s                            0.0799  

For next week, I will implement extreme learning machine algorithms. These use the least-square solution approach for training neural networks, and therefore they are both quick and efficient with interesting applications.

by noreply@blogger.com (Issam Laradji) at June 13, 2014 03:52 AM

June 12, 2014

Jake Vanderplas

Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix

In Douglas Adams' classic Hitchhiker's Guide to the Galaxy, hyper-intelligent pan-dimensional beings build a computer named Deep Thought in order to calculate "the Answer to the Ultimate Question of Life, the Universe, and Everything". After seven and a half million years spinning its hyper-dimensional gears, before an excited crowd, Deep Thought finally outputs the answer:


The disappointed technicians, who trained a lifetime for this moment, are stupefied. They probe Deep Though for more information, and after some back-and-forth, the computer responds: "once you do know what the question actually is, you'll know what the answer means."

An answer does you no good if you don't know the question.

I find this story be an apt metaphor for statistics as sometimes used in the scientific literature. When trying to estimate the value of an unknown parameter, the frequentist approach generally relies on a confidence interval (CI), while the Bayesian approach relies on a credible region (CR). While these concepts sound and look very similar, their subtle difference can be extremely important, as they answer essentially different questions.

Like the poor souls hoping for enlightenment in Douglas Adams' universe, scientists often turn the crank of frequentism hoping for useful answers, but in the process overlook the fact that in science, frequentism is generally answering the wrong question. This is far from simple philosophical navel-gazing: as I'll show, it can have real consequences for the conclusions we draw from observed data.

Confidence vs. Credibility

In part I of this series, we discussed the basic philosophical difference between frequentism and Bayesianism: frequentists consider probability a measure of the frequency of (perhaps hypothetical) repeated events; Bayesians consider probability as a measure of the degree of certainty about values. As a result of this, speaking broadly, frequentists consider model parameters to be fixed and data to be random, while Bayesians consider model parameters to be random and data to be fixed.

These philosophies fundamenally affect the way that each approach seeks bounds on the value of a model parameter. Because the differences here are subtle, I'll go right into a simple example to illustrate the difference between a frequentist confidence interval and a Bayesian credible region.

Example 1: The Mean of a Gaussian

Let's start by again examining an extremely simple problem; this is the same problem we saw in part I of this series: finding the mean of a Gaussian distribution. Previously we simply looked at the (frequentist) maximum likelihood and (Bayesian) maximum a posteriori estimates; here we'll extend this and look at confidence intervals and credibile regions.

Here is the problem: imagine you're observing a star that you assume has a constant brightness. Simplistically, we can think of this brightness as the number of photons reaching our telescope in one second. Any given measurement of this number will be subject to measurement errors: the source of those errors is not important right now, but let's assume the observations \(x_i\) are drawn from a normal distribution about the true brightness value with a known standard deviation \(\sigma_x\).

Given a series of measurements, what are the 95% (i.e. \(2\sigma\)) limits that we would place on the brightness of the star?

1. The Frequentist Approach

The frequentist approach to this problem is well-known, and is as follows:

For any set of \(N\) values \(D = \{x_i\}_{i=1}^N\), an unbiased estimate of the mean \(\mu\) of the distribution is given by

\[ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_i \]

The sampling distribution describes the observed frequency of the estimate of the mean; by the central limit theorem we can show that the sampling distribution is normal; i.e.

\[ f(\bar{x}~||~\mu) \propto \exp\left[\frac{-(\bar{x} - \mu)^2}{2\sigma_\mu^2}\right] \]

where we've used the standard error of the mean,

\[ \sigma_\mu = \sigma_x / \sqrt{N} \]

The central limit theorem tells us that this is a reasonable approximation for any generating distribution if \(N\) is large; if our generating distribution happens to be Gaussian, it also holds for \(N\) as small as 2.

Let's quickly check this empirically, by looking at \(10^6\) samples of the mean of 5 numbers:

In [20]:
import numpy as np

N = 5
Nsamp = 10 ** 6
sigma_x = 2

x = np.random.normal(0, sigma_x, size=(Nsamp, N))
mu_samp = x.mean(1)
sig_samp = sigma_x * N ** -0.5

print("{0:.3f} should equal {1:.3f}".format(np.std(mu_samp), sig_samp))
0.894 should equal 0.894

It checks out: the standard deviation of the observed means is equal to \(\sigma_x N^{-1/2}\), as expected.

From this normal sampling distribution, we can quickly write the 95% confidence interval by recalling that two standard deviations is roughly equivalent to 95% of the area under the curve. So our confidence interval is

\[ CI_{\mu} = \left(\bar{x} - 2\sigma_\mu,~\bar{x} + 2\sigma_\mu\right) \]

Let's try this with a quick example: say we have three observations with an error (i.e. \(\sigma_x\)) of 10. What is our 95% confidence interval on the mean?

We'll generate our observations assuming a true value of 100:

In [21]:
true_B = 100
sigma_x = 10

D = np.random.normal(true_B, sigma_x, size=3)
[ 116.24345364   93.88243586   94.71828248]

Next let's create a function which will compute the confidence interval:

In [22]:
from scipy.special import erfinv

def freq_CI_mu(D, sigma, frac=0.95):
    """Compute the confidence interval on the mean"""
    # we'll compute Nsigma from the desired percentage
    Nsigma = np.sqrt(2) * erfinv(frac)
    mu = D.mean()
    sigma_mu = sigma * D.size ** -0.5
    return mu - Nsigma * sigma_mu, mu + Nsigma * sigma_mu

print("95% Confidence Interval: [{0:.0f}, {1:.0f}]".format(*freq_CI_mu(D, 10)))
95% Confidence Interval: [90, 113]

Note here that we've assumed \(\sigma_x\) is a known quantity; this could also be estimated from the data along with \(\mu\), but here we kept things simple for sake of example.

2. The Bayesian Approach

For the Bayesian approach, we start with Bayes' theorem:

\[ P(\mu~|~D) = \frac{P(D~|~\mu)P(\mu)}{P(D)} \]

We'll use a flat prior on \(\mu\) (i.e. \(P(\mu) \propto 1\) over the region of interest) and use the likelihood

\[ P(D~|~\mu) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma_x^2}}\exp\left[\frac{(\mu - x_i)^2}{2\sigma_x^2}\right] \]

Computing this product and manipulating the terms, it's straightforward to show that this gives

\[ P(\mu~|~D) \propto \exp\left[\frac{-(\mu - \bar{x})^2}{2\sigma_\mu^2}\right] \]

which is recognizable as a normal distribution with mean \(\bar{x}\) and standard deviation \(\sigma_\mu\). That is, the Bayesian posterior on \(\mu\) in this case is exactly equal to the frequentist sampling distribution for \(\mu\).

From this posterior, we can compute the Bayesian credible region, which is the shortest interval that contains 95% of the probability. Here, it looks exactly like the frequentist confidence interval:

\[ CR_{\mu} = \left(\bar{x} - 2\sigma_\mu,~\bar{x} + 2\sigma_\mu\right) \]

For completeness, we'll also create a function to compute the Bayesian credible region:

In [23]:
def bayes_CR_mu(D, sigma, frac=0.95):
    """Compute the credible region on the mean"""
    Nsigma = np.sqrt(2) * erfinv(frac)
    mu = D.mean()
    sigma_mu = sigma * D.size ** -0.5
    return mu - Nsigma * sigma_mu, mu + Nsigma * sigma_mu

print("95% Credible Region: [{0:.0f}, {1:.0f}]".format(*bayes_CR_mu(D, 10)))
95% Credible Region: [90, 113]

So What's the Difference?

The above derivation is one reason why the frequentist confidence interval and the Bayesian credible region are so often confused. In many simple problems, they correspond exactly. But we must be clear that even though the two are numerically equivalent, their interpretation is very different.

Recall that in Bayesianism, the probability distributions reflect our degree of belief. So when we computed the credible region above, it's equivalent to saying

"Given our observed data, there is a 95% probability that the true value of \(\mu\) falls within \(CR_\mu\)" - Bayesians

In frequentism, on the other hand, \(\mu\) is considered a fixed value and the data (and all quantities derived from the data, including the bounds of the confidence interval) are random variables. So the frequentist confidence interval is equivalent to saying

"There is a 95% probability that when I compute \(CI_\mu\) from data of this sort, the true mean will fall within \(CI_\mu\)." - Frequentists

Note the difference: the Bayesian solution is a statement of probability about the parameter value given fixed bounds. The frequentist solution is a probability about the bounds given a fixed parameter value. This follows directly from the philosophical definitions of probability that the two approaches are based on.

The difference is subtle, but, as I'll discuss below, it has drastic consequences. First, let's further clarify these notions by running some simulations to confirm the interpretation.

Confirming the Bayesian Credible Region

To confirm what the Bayesian credible region is claiming, we must do the following:

  1. sample random \(\mu\) values from the prior
  2. sample random sets of points given each \(\mu\)
  3. select the sets of points which match our observed data
  4. ask what fraction of these \(\mu\) values are within the credible region we've constructed.

In code, that looks like this:

In [24]:
# first define some quantities that we need 
Nsamples = 2E7
N = len(D)
sigma_x = 10

# if someone changes N, this could easily cause a memory error
if N * Nsamples > 1E8:
    raise ValueError("Are you sure you want this many samples?")
# eps tells us how close to D we need to be to consider
# it a matching sample. The value encodes the tradeoff
# between bias and variance of our simulation
eps = 0.5

# Generate some mean values from the (flat) prior in a reasonable range
mu = 80 + 40 * np.random.random(Nsamples)

# Generate data for each of these mean values
x = np.random.normal(mu, sigma_x, (N, Nsamples)).T

# find data which matches matches our "observed" data
i = np.all(abs(x - D) < eps, 1)
print("number of suitable samples: {0}".format(i.sum()))
number of suitable samples: 528

In [25]:
# Now we ask how many of these mu values fall in our credible region
mu_good = mu[i]
CR = bayes_CR_mu(D, 10)
within_CR = (CR[0] < mu_good) & (mu_good < CR[1])
print "Fraction of means in Credible Region: {0:.3f}".format(within_CR.sum() * 1. / within_CR.size)
Fraction of means in Credible Region: 0.949

We see that, as predicted, roughly 95% of \(\mu\) values with data matching ours lie in the Credible Region.

The important thing to note here is which of the variables is random, and which are fixed. In the Bayesian approach, we compute a single credible region from our observed data, and we consider it in terms of multiple random draws of \(\mu\).

Confirming the frequentist Confidence Interval

Confirmation of the interpretation of the frequentist confidence interval is a bit less involved. We do the following:

  1. draw sets of values from the distribution defined by the single true value of \(\mu\).
  2. for each set of values, compute a new confidence interval.
  3. determine what fraction of these confidence intervals contain \(\mu\).

In code, it looks like this:

In [26]:
# define some quantities we need
N = len(D)
Nsamples = 1E4
mu = 100
sigma_x = 10

# Draw datasets from the true distribution
x = np.random.normal(mu, sigma_x, (Nsamples, N))

# Compute a confidence interval from each dataset
CIs = np.array([freq_CI_mu(Di, sigma_x) for Di in x])

# find which confidence intervals contain the mean
contains_mu = (CIs[:, 0] < mu) & (mu < CIs[:, 1])
print "Fraction of Confidence Intervals containing the mean: {0:.3f}".format(contains_mu.sum() * 1. / contains_mu.size)
Fraction of Confidence Intervals containing the mean: 0.951

We see that, as predicted, 95% of the confidence intervals contain the true value of \(\mu\).

Again, the important thing to note here is which of the variables is random. We use a single value of \(\mu\), and consider it in relation to multiple confidence intervals constructed from multiple random data samples.


We should remind ourselves again of the difference between the two types of constraints:

  • The Bayesian approach fixes the credible region, and guarantees 95% of possible values of \(\mu\) will fall within it.
  • The frequentist approach fixes the parameter, and guarantees that 95% of possible confidence intervals will contain it.

Comparing the frequentist confirmation and the Bayesian confirmation above, we see that the distinctions which stem from the very definition of probability mentioned above:

  • Bayesianism treats parameters (e.g. \(\mu\)) as random variables, while frequentism treats parameters as fixed.
  • Bayesianism treats observed data (e.g. \(D\)) as fixed, while frequentism treats data as random variables.
  • Bayesianism treats its parameter constraints (e.g. \(CR_\mu\)) as fixed, while frequentism treats its constraints (e.g. \(CI_\mu\)) as random variables.

In the above example, as in many simple problems, the confidence interval and the credibility region overlap exactly, so the distinction is not especially important. But scientific analysis is rarely this simple; next we'll consider an example in which the choice of approach makes a big difference.

Example 2: Jaynes' Truncated Exponential

For an example of a situation in which the frequentist confidence interval and the Bayesian credibility region do not overlap, I'm going to turn to an example given by E.T. Jaynes, a 20th century physicist who wrote extensively on statistical inference in Physics. In the fifth example of his Confidence Intervals vs. Bayesian Intervals (pdf), he considers a truncated exponential model. Here is the problem, in his words:

A device will operate without failure for a time \(\theta\) because of a protective chemical inhibitor injected into it; but at time \(\theta\) the supply of the chemical is exhausted, and failures then commence, following the exponential failure law. It is not feasible to observe the depletion of this inhibitor directly; one can observe only the resulting failures. From data on actual failure times, estimate the time \(\theta\) of guaranteed safe operation...

Essentially, we have data \(D\) drawn from the following model:

\[ p(x~|~\theta) = \left\{ \begin{array}{lll} \exp(\theta - x) &,& x > \theta\\ 0 &,& x < \theta \end{array} \right\} \]

where \(p(x~|~\theta)\) gives the probability of failure at time \(x\), given an inhibitor which lasts for a time \(\theta\). Given some observed data \(D = \{x_i\}\), we want to estimate \(\theta\).

Let's start by plotting this model for a particular value of \(\theta\), so we can see what we're working with:

In [27]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def p(x, theta):
    return (x > theta) * np.exp(theta - x)

x = np.linspace(5, 18, 1000)
plt.fill(x, p(x, 10), alpha=0.3)
plt.ylim(0, 1.2)

Imagine now that we've observed some data, \(D = \{10, 12, 15\}\), and we want to infer the value of \(\theta\) from this data. We'll explore four approaches to this below.

1. Common Sense Approach

One general tip that I'd always recommend: in any problem, before computing anything, think about what you're computing and guess what a reasonable solution might be. We'll start with that here. Thinking about the problem, the hard cutoff in the probability distribution leads to one simple observation: \(\theta\) must be smaller than the smallest observed value.

This is immediately obvious on examination: the probability of seeing a value less than \(\theta\) is zero. Thus, a model with \(\theta\) greater than any observed value is impossible, assuming our model specification is correct. Our fundamental assumption in both Bayesianism and frequentism is that the model is correct, so in this case, we can immediately write our common sense condition:

\[ \theta < \min(D) \]

or, in the particular case of \(D = \{10, 12, 15\}\),

\[ \theta < 10 \]

Any reasonable constraint on \(\theta\) given this data should meet this criterion. With this in mind, let's go on to some quantitative approaches based on Frequentism and Bayesianism.

2. Frequentist approach #1: Sampling Distribution via the Normal Approximation

In the frequentist paradigm, we'd like to compute a confidence interval on the value of \(\theta\). We can start by observing that the population mean is given by

\[ \begin{array}{ll} E(x) &= \int_0^\infty xp(x)dx\\ &= \theta + 1 \end{array} \]

So, using the sample mean as the point estimate of \(E(x)\), we have an unbiased estimator for \(\theta\) given by

\[ \hat{\theta} = \frac{1}{N} \sum_{i=1}^N x_i - 1 \]

The exponential distribution has a standard deviation of 1, so in the limit of large \(N\), we can use the standard error of the mean (as above) to show that the sampling distribution of \(\hat{\theta}\) will approach normal with variance \(\sigma^2 = 1 / N\). Given this, we can write our 95% (i.e. 2\(\sigma\)) confidence interval as

\[ CI_{\rm large~N} = \left(\hat{\theta} - 2 N^{-1/2},~\hat{\theta} + 2 N^{-1/2}\right) \]

Let's write a function which will compute this, and evaluate it for our data:

In [28]:
from scipy.special import erfinv

def approx_CI(D, sig=0.95):
    """Approximate truncated exponential confidence interval"""
    # use erfinv to convert percentage to number of sigma
    Nsigma = np.sqrt(2) * erfinv(sig)
    D = np.asarray(D)
    N = D.size
    theta_hat = np.mean(D) - 1
    return [theta_hat - Nsigma / np.sqrt(N),
            theta_hat + Nsigma / np.sqrt(N)]
In [29]:
D = [10, 12, 15]
print("approximate CI: ({0:.1f}, {1:.1f})".format(*approx_CI(D)))
approximate CI: (10.2, 12.5)

We immediately see an issue. By our simple common sense argument, we've determined that it is impossible for \(\theta\) to be greater than 10, yet the entirety of the 95% confidence interval is above this range! Perhaps this issue is due to the small sample size: the above computation is based on a large-\(N\) approximation, and we have a relatively paltry \(N = 3\). Maybe this will be improved if we do the more computationally intensive exact approach. Let's try it:

3. Frequentist approach #2: Exact Sampling Distribution

Computing the confidence interval from the exact sampling distribution takes a bit more work.

For small \(N\), the normal approximation will not apply, and we must instead compute the confidence integral from the actual sampling distribution, which is the distribution of the mean of \(N\) variables each distributed according to \(p(\theta)\). The sum of random variables is distributed according to the convolution of the distributions for individual variables, so we can exploit the convolution theorem and use the method of [characteristic functions](http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory) to find the following sampling distribution for the sum of \(N\) variables distributed according to our particular \(p(x~|~\theta)\):

\[ f(\theta~|~D) \propto \left\{ \begin{array}{lll} z^{N - 1}\exp(-z) &,& z > 0\\ 0 &,& z < 0 \end{array} \right\} ;~ z = N(\hat{\theta} + 1 - \theta) \]

To compute the 95% confidence interval, we can start by computing the cumulative distribution: we integrate \(f(\theta~|~D)\) from \(0\) to \(\theta\). This is relatively painless if we make use of the expression for the incomplete gamma function:

\[ \Gamma(a, x) = \int_x^\infty t^{a - 1}e^{-t} dt \]

which looks strikingly similar to our \(f(\theta)\).

Using this to perform the integral, we find that the cumulative distribution is given by

\[ F(\theta~|~D) = \frac{1}{\Gamma(N)}\left[ \Gamma\left(N, \max[0, N(\hat{\theta} + 1 - \theta)]\right) - \Gamma\left(N,~N(\hat{\theta} + 1)\right)\right] \]

A contiguous 95% confidence interval \((\theta_1, \theta_2)\) satisfies the following equation:

\[ F(\theta_2~|~D) - F(\theta_1~|~D) = 0.95 \]

There are in fact an infinite set of solutions to this; what we want is the shortest of these. We'll add the constraint that the probability density is equal at either side of the interval:

\[ f(\theta_2~|~D) = f(\theta_1~|~D) \]

(Jaynes claims that this criterion ensures the shortest possible interval, but I'm not sure how to prove that). Solving this system of two nonlinear equations will give us the desired confidence interval. Let's compute this numerically:

In [30]:
from scipy.special import gammaincc
from scipy import optimize

def exact_CI(D, frac=0.95):
    """Exact truncated exponential confidence interval"""
    D = np.asarray(D)
    N = D.size
    theta_hat = np.mean(D) - 1

    def f(theta, D):
        z = theta_hat + 1 - theta
        return (z > 0) * z ** (N - 1) * np.exp(-N * z)

    def F(theta, D):
        return gammaincc(N, np.maximum(0, N * (theta_hat + 1 - theta))) - gammaincc(N, N * (theta_hat + 1))
    def eqns(CI, D):
        """Equations which should be equal to zero"""
        theta1, theta2 = CI
        return (F(theta2, D) - F(theta1, D) - frac,
                f(theta2, D) - f(theta1, D))
    guess = approx_CI(D, 0.68) # use 1-sigma interval as a guess
    result = optimize.root(eqns, guess, args=(D,))
    if not result.success:
        print "warning: CI result did not converge!"
    return result.x

As a sanity check, let's make sure that the exact and approximate confidence intervals match for a large number of points:

In [31]:
Dlarge = 10 + np.random.random(500)
print "approx: ({0:.3f}, {1:.3f})".format(*approx_CI(Dlarge))
print "exact: ({0:.3f}, {1:.3f})".format(*exact_CI(Dlarge))
approx: (9.409, 9.584)
exact: (9.408, 9.584)

As expected, the approximate solution is very close to the exact solution for large \(N\), which gives us confidence that we're computing the right thing.

Let's return to our 3-point dataset and see the results:

In [32]:
print("approximate CI: ({0:.1f}, {1:.1f})".format(*approx_CI(D)))
print("exact CI:       ({0:.1f}, {1:.1f})".format(*exact_CI(D)))
approximate CI: (10.2, 12.5)
exact CI:       (10.2, 12.2)

The exact confidence interval is slightly different than the approximate one, but still reflects the same problem: we know from common-sense reasoning that \(\theta\) can't be greater than 10, yet the 95% confidence interval is entirely in this forbidden region! The confidence interval seems to be giving us unreliable results.

We'll discuss this in more depth further below, but first let's see if Bayes can do better.

4. Bayesian Credibility Interval

For the Bayesian solution, we start by writing Bayes' rule:

\[ p(\theta~|~D) = \frac{p(D~|~\theta)p(\theta)}{P(D)} \]

Using a constant prior \(p(\theta)\), and with the likelihood

\[ p(D~|~\theta) = \prod_{i=1}^N p(x~|~\theta) \]

we find

\[ p(\theta~|~D) \propto \left\{ \begin{array}{lll} N\exp\left[N(\theta - \min(D))\right] &,& \theta < \min(D)\\ 0 &,& \theta > \min(D) \end{array} \right\} \]

where \(\min(D)\) is the smallest value in the data \(D\), which enters because of the truncation of \(p(x~|~\theta)\). Because \(p(\theta~|~D)\) increases exponentially up to the cutoff, the shortest 95% credibility interval \((\theta_1, \theta_2)\) will be given by

\[ \theta_2 = \min(D) \]

and \(\theta_1\) given by the solution to the equation

\[ \int_{\theta_1}^{\theta_2} N\exp[N(\theta - \theta_2)]d\theta = f \]

this can be solved analytically by evaluating the integral, which gives

\[ \theta_1 = \theta_2 + \frac{\log(1 - f)}{N} \]

Let's write a function which computes this:

In [33]:
def bayes_CR(D, frac=0.95):
    """Bayesian Credibility Region"""
    D = np.asarray(D)
    N = float(D.size)
    theta2 = D.min()
    theta1 = theta2 + np.log(1. - frac) / N
    return theta1, theta2

Now that we have this Bayesian method, we can compare the results of the four methods:

In [34]:
print("common sense:         theta < {0:.1f}".format(np.min(D)))
print("frequentism (approx): 95% CI = ({0:.1f}, {1:.1f})".format(*approx_CI(D)))
print("frequentism (exact):  95% CI = ({0:.1f}, {1:.1f})".format(*exact_CI(D)))
print("Bayesian:             95% CR = ({0:.1f}, {1:.1f})".format(*bayes_CR(D)))
common sense:         theta < 10.0
frequentism (approx): 95% CI = (10.2, 12.5)
frequentism (exact):  95% CI = (10.2, 12.2)
Bayesian:             95% CR = (9.0, 10.0)

What we find is that the Bayesian result agrees with our common sense, while the frequentist approach does not. The problem is that frequentism is answering the wrong question. I'll discuss this more below, but first let's do some simulations to make sure the CI and CR in this case are correct.

Numerical Confirmation

To try to quell any doubts about the math here, I want to repeat the exercise we did above and show that the confidence interval derived above is, in fact, correct. We'll use the same approach as before, assuming a "true" value for \(\theta\) and sampling data from the associated distribution:

In [35]:
from scipy.stats import expon

Nsamples = 1000
N = 3
theta = 10

data = expon(theta).rvs((Nsamples, N))
CIs = np.array([exact_CI(Di) for Di in data])

# find which confidence intervals contain the mean
contains_theta = (CIs[:, 0] < theta) & (theta < CIs[:, 1])
print "Fraction of Confidence Intervals containing theta: {0:.3f}".format(contains_theta.sum() * 1. / contains_theta.size)
Fraction of Confidence Intervals containing theta: 0.953

As is promised by frequentism, 95% of the computed confidence intervals contain the true value. The procedure we used to compute the confidence intervals is, in fact, correct: our data just happened to be among the 5% where the method breaks down. But here's the thing: we know from the data themselves that we are in the 5% where the CI fails. The fact that the standard frequentist confidence interval ignores this common-sense information should give you pause about blind reliance on the confidence interval for any nontrivial problem.

For good measure, let's check that the Bayesian credible region also passes its test:

In [38]:
N = 1E7
eps = 0.1

theta = 9 + 2 * np.random.random(N)
data = (theta + expon().rvs((3, N))).T
i_good = np.all(abs(data - D) < eps, 1)

print("Number of good samples: {0}".format(i_good.sum()))
Number of good samples: 65

In [39]:
theta_good = theta[i_good]
theta1, theta2 = bayes_CR(D)

within_CR = (theta1 < theta_good) & (theta_good < theta2)
print("Fraction of thetas in Credible Region: {0:.3f}".format(within_CR.sum() * 1. / within_CR.size))
Fraction of thetas in Credible Region: 0.954

Again, we have confirmed that, as promised, ~95% of the suitable values of \(\theta\) fall in the credible region we computed from our single observed sample.

Frequentism Answers the Wrong Question

We've shown that the frequentist approach in the second example is technically correct, but it disagrees with our common sense. What are we to take from this?

Here's the crux of the problem: The frequentist confidence interval, while giving the correct answer, is usually answering the wrong question. And this wrong-question approach is the result of a probability definition which is fundamental to the frequentist paradigm!

Recall the statements about confidence intervals and credible regions that I made above. From the Bayesians:

"Given our observed data, there is a 95% probability that the true value of \(\theta\) falls within the credible region" - Bayesians

And from the frequentists:

"There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of \(\theta\) will fall within it." - Frequentists

Now think about what this means. Suppose you've measured three failure times of your device, and you want to estimate \(\theta\). I would assert that "data of this sort" is not your primary concern: you should be concerned with what you can learn from those particular three observations, not the entire hypothetical space of observations like them. As we saw above, if you follow the frequentists in considering "data of this sort", you are in danger at arriving at an answer that tells you nothing meaningful about the particular data you have measured.

Suppose you attempt to change the question and ask what the frequentist confidence interval can tell you given the particular data that you've observed. Here's what it has to say:

"Given this observed data, the true value of \(\theta\) is either in our confidence interval or it isn't" - Frequentists

That's all the confidence interval means – and all it can mean! – for this particular data that you have observed. Really. I'm not making this up. You might notice that this is simply a tautology, and can be put more succinctly:

"Given this observed data, I can put no constraint on the value of \(\theta\)" - Frequentists

If you're interested in what your particular, observed data are telling you, frequentism is useless.

Hold on... isn't that a bit harsh?

This might be a harsh conclusion for some to swallow, but I want to emphasize that it is not simply a matter of opinion or idealogy; it's an undeniable fact based on the very philosophical stance underlying frequentism and the very definition of the confidence interval. If what you're interested in are conclusions drawn from the particular data you observed, frequentism's standard answers (i.e. the confidence interval and the closely-related \(p\)-values) are entirely useless.

Unfortunately, most people using frequentist principles in practice don't seem to realize this. I'd point out specific examples from the astronomical literature, but I'm not in the business of embarassing people directly (and they're easy enough to find now that you know what to look for). Many scientists operate as if the confidence interval is a Bayesian credible region, but it demonstrably is not. This oversight can perhaps be forgiven for the statistical layperson, as even trained statisticians will often mistake the interpretation of the confidence interval.

I think the reason this mistake is so common is that in many simple cases (as I showed in the first example above) the confidence interval and the credible region happen to coincide. Frequentism, in this case, correctly answers the question you ask, but only because of the happy accident that Bayesianism gives the same result for that problem.

Now, I should point out that I am certainly not the first person to state things this way, or even this strongly. The Physicist E.T. Jaynes was known as an ardent defender of Bayesianism in science; one of my primary inspirations for this post was his 1976 paper, Confidence Intervals vs. Bayesian Intervals (pdf). More recently, statistician and blogger W.M. Briggs posted a diatribe on arXiv called It's Time To Stop Teaching Frequentism to Non-Statisticians which brings up this same point. It's in the same vein of argument that Savage, Cornfield, and other outspoken 20th-century Bayesian practitioners made throughout their writings, talks, and correspondance.

So should you ever use confidence intervals at all? Perhaps in situations (such as analyzing gambling odds) where multiple data realizations are the reality, frequentism makes sense. But in most scientific applications where you're concerned with what one particular observed set of data is telling you, frequentism simply answers the wrong question.

The Moral of the Story: Frequentism and Science Do Not Mix

The moral of the story is that frequentism and Science do not mix. Let me say it directly: you should be suspicious of the use of frequentist confidence intervals and p-values in science. In a scientific setting, confidence intervals, and closely-related p-values, provide the correct answer to the wrong question. In particular, if you ever find someone stating or implying that a 95% confidence interval is 95% certain to contain a parameter of interest, do not trust their interpretation or their results. If you happen to be peer-reviewing the paper, reject it. Their data do not back-up their conclusion.

If you have made it this far, I thank you for reading! I hope that this exploration succeeded in clarifying that the philosophical assumptions of frequentism and Bayesianism lead to real, practical consequences in the analysis and interpretation of data. If this post leads just one researcher to stop using frequentist confidence intervals in their scientific research, I will consider it a success.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view on nbviewer. Many thanks to David W. Hogg for some helpful critiques on an early version of this post.

by Jake Vanderplas at June 12, 2014 07:00 PM

Philip Herron


Assembly programming is a black hole for new programmers these days, its almost impossible to find really easy to follow tutorial on the subject but some people in work were talking about it and i felt i could shed some light on it.

Consider a really simple program in C

int main (void)
  return 2;

You can compile this:

$ gcc -g -O2 -Wall test.c -o test

And whoop we compiled a very useless program but when we run it we get:

$ ./test; echo $?

So we done something! But how does all of that work i mean how was this able to return a value to the shell. We could write this in assembly (i386/x86) and i can show you what happens here to return this value its call the c-decl ABI (http://en.wikipedia.org/wiki/X86_calling_conventions)

.global	main
	mov $2, %eax

To compile this you simply:

$ as fib2.s -o fib2.o
$ gcc -o test fib2.o
$ ./test; echo $?

So what does this mean, well i am sure it looks some what clear whats going on, we declared a global symbol main and its code simply puts the value 2 into the register eax and returns. The ret opcode is a fairly useful one, it is responsible for restoring the instruction pointer for the program back to the caller. You can do all of this manually by messing with the instruction pointer yourself but thats for another day see (http://pdos.csail.mit.edu/6.893/2009/readings/i386/RET.htm)

In the next post i will write how we can call into libc functions such as printf for a hello world from assembly in i386. This will introduce the stack.

by Philip Herron at June 12, 2014 03:37 PM


Hey all reading this, its a fairly personal post and probably controversial for me to make.

About a year ago i was working for a great wee company WANdisco i had such a great set of people to work with, we were all so close and work was such a joy to go to. Exciting and vibrant when i was working there for bit over a year and i started to yearn for a more ‘low-level’ challenge such as a low-latency C/C++ application. I felt i had so much to learn in this area and its where i wanted my career to go towards.

So i joined where i am now (not going to write to avoid google picking it up) and i began to realise the pain of legacy systems programming and in my opinion i found it a challenge to some extent but it was never the code base i had issues with i mean with my experience in open source and GCC which is older and bigger than most legacy code-bases you will ever find. The issues i had were more high-level, we were never given freedom to be pro-active about maintaining code (fixing things before they become an issue). This lead to much time being very idle and frustrated at being quite literately un-challenged. And this was seen to be ‘busy’ with everyone embellishing their job and work to sound better to hopefully eventually gain some kind of promotion or some kind of management role.

And though this job i found such frustration i became very depressed to the point i though almost that Software Engineering isn’t a career path that i would want to pursue any longer….

But now its all fixed and for the first time in probably 2 years i feel settled, i mean i went to lunch and everything seemed brighter and better than i have felt in so long and i have been smiling all day. I never realised how depressed i had become, its so dangerous when i bottled things up so much that it would personify as physical illness such as bad colds and coughs on and off. I would bottle up so much because my Significant Other i really didn’t want for her to have to deal with my problems i love her so much and i just want to make her happy but my career was bringing us both down.

But you know what i finally have a new job where i will be starting in about a month (will reveal closer to the time). And you know what the people were so nice to me and challenged me and made me work in the interview i felt at peace again. Like i knew what i was doing again being proactive challenging myself and had the desire to push myself. And more importantly understood why i participate in open-source, at this job very very frequently you will be forced to deal with this question ‘sure, why would you do that? Were you paid? No? What…?’.

Its probably important to note, their reasons were i guess to some extent legitimate such as project approvals from such a large corporate entity and product stability. But regardless of this i would still argue that failure to fix and update is kind of like buying a house that really needs the plumbing and electrics re-done, sure you can live in it and use it but its not that pleasant and has a very high likely hood of costing you a lot in the long run.

Though this year i have talked to many many people on-line and off-line; strangers who become friends on a bus from different walks of life and different careers. And the universal notion we all have had is that, career will only take you so far because in the end it is we who have to live our lives and we can’t let circumstances get in the way of enjoying life.

My depression at this job was so great i believe if it went on for another year or 2 it probably was getting close to costing me my relationship with my Significant Other and maybe even my life. I probably shouldn’t go into the nitty details of what it was like here but if your in a similar position i would love to hear from you via email or if your local to Northern Ireland i would like to share a coffee with you.

Take care of your mental health i mean seriously the stigma it has in some countries still scares me such as Northern Ireland. Your likely to hear people equate Mental Health with the ‘luney bin’.


by Philip Herron at June 12, 2014 12:46 PM

William Stein

SageMathCloud Task Lists

I've added task list functionality to SageMathCloud (SMC), so you can keep track of a list of things to do related to a project, paper, etc. Task lists are saved as files on the filesystem, so they can be backed up in the usual way, automatically generated, etc. I doubt anybody is going to use SMC just for the tasks lists, but for people already using SMC in order to use Sage, write LaTeX documents, use IPython notebooks, etc., having a convenient integrated task list should come in handy.
To create a task list, in a project, click "+New", name the task list, then click the "Task List" button.  Here's what a task list looks like:

The Design

I used the task list quite a bit when implementing the task list, and significantly modified the interface many, many times. I also tried out numerous other todo list programs for inspiration. I eventually settled on the following key design choices, some of which are different than anything I've seen. In particular, the design targets highly technical users, which is not something I saw with any other todo list programs I tried.
  • Markdown editor: The task description is formatted using client-side rendered Github flavored markdown (using marked), including [ ] for checkboxes. I also include full MathJax support, and spent quite a while working around various subtleties of mixing mathjax and markdown. I'll be rewriting Sage worksheets to use this code. The editor itself is Codemirror 4 in markdown mode, so it respects whatever theme you choose, has nice features like multiple cursors, paren matching, vim/emacs modes, etc. Multiple people can edit the same task at once and everybody will see the changes as they are made (note: I haven't implemented showing other cursors.)
  • Relative dates and times: All dates and times are shown relative to right now. If something is due in 20 hours, it says "due about 20 hours from now". I also included a sortable column with the last time when a task was edited, also shown relative to the current time. This display uses the timeago jquery plugin. You can of course click on the due date to see the actual date.
  • Hashtags: As I was implementing (and removing) features such as priorities, a way to indicate which task you are currently working on, folders, etc, I decided that hashtags can provide every feature. Moreover, they are very text editor/programmer friendly. When editing a task, if you put #foo, then it is interpreted as a hashtag, which you can click on to show only tasks containing that tag. To disambiguate with markdown headings, to make a heading you have to include a whitespace, so # foo. I haven't added autocomplete for hashtags yet, but will. You can easily click on hashtags anywhere to select them, or use the bar at the top.
  • User-specified task order: The default sort order for tasks is custom. There is a drag handle so you can explicitly move tasks up and down in order to indicate how important they are to you (or whatever else). You can also click an up hand or down hand to knock the currently selected task to the bottom of the list of displayed tasks.
Of course, I still have an enormous list of functionality I would like to add, but that will have to wait. For example, I need to enable a chat associated to each task list, like the chats associated to Sage worksheets and other files. I also want to make it so one can select a range of tasks and copy them, move them, paste them into another list, etc. It would also be nice to be able to export task lists to HTML, PDF, etc., which should be fairly easy using pandoc.  I'm also considering making a note list, which is like a task list but without the due date or "done" box.  Because of all the infrastructure already there, it would be easy to add code evaluation functionality, thus creating something like worksheets, but from a different point of view (with maybe hashtags determining the Python process).

Databases and Differential Synchronization

One interesting thing I noticed when implementing task lists is that there are many similarities with the original sagenb.org design (and also IPython notebook), in which a worksheet is a list of cells that get evaluated, can be manually reordered, etc. Similarly, a task list is a list of "cells" that you edit, manually reorder, etc. With sagenb we had longstanding issues involving the order of each cell and assigning an integer numerical id (0, 1, 2, ...) to the cells, which resulted in things like cells seemingly getting randomly reordered, etc. Also, having multiple simultaneous viewers with automatic synchronization is very difficult with that model. For task lists, I've introduced some new models to address these issues.

A natural way to store a task list is in a database, and I initially spent some time coming up with a good database schema and implementing basic lists using Cassandra for the backend. However, I couldn't pursue this approach further, since I was concerned about how to implement realtime synchronization, and also about the lack of easily backing up complete task lists via snapshots, in git, etc. So instead I created an "object database" API built on top of a file that is synchronized across clients (and the server) using differential synchronization. The underlying file format for the database is straightforward -- there is one line in JSON format for each record in the database. When objects are changed, the file is suitably modified, synchronized to other clients, and events are triggered.

Since differential synchronization has no trouble with files that have "a few thousand lines", this approach works well for our purposes (since personal or shared todo lists are typically fairly short). Also, having one line per record is at least more git friendly than something like a sqlite database. I'm considering rewriting my implementation of IPython notebook sync on top of this abstraction.
Since I view the task list as a database, each task has a globally unique uuid. Also, instead of viewing the task order as being defined by an integer 0,1,2,3, which leads to all manner of bugs and programming misery, race conditions, etc., instead we view the order as being determined by floating point positions. So to insert a task between tasks with positions 1 and 2, we just give the task position 1.5.

by William Stein (noreply@blogger.com) at June 12, 2014 01:31 PM