July 09, 2014

Continuum Analytics

Bokeh 0.5 Released!

We are excited to announce the release of version 0.5 of Bokeh, an interactive web plotting library for Python!

This release includes some major new features: support for Widgets and Dashboards, preliminary integration of Abstract Rendering for big data visualization, and a new schematic bokeh.charts interface for very high level statistical charting. Additionally, tighter Pandas integration makes using Bokeh with the tools we all love even easier, and a new, much simpler bokeh.embed module makes including Bokeh content in your documents a snap. Over forty bugfixes and several new smaller features are also included: log axes, minor ticks, a new plot frame, gears and gauges glyphs, an npm package for BokehJS, and many usability improvements.

July 08, 2014

Matthieu Brucher

Announcement: Audio TK 0.4.0

A lot has happened in two weeks for Audio ToolKit. This release mainly adds tools for Compressor and Delays design. There will be additional plugins release soon.

Changelog:

0.4.0
* Added a white noise generator filter
* Fixed the delay line filters
* Fixed second order EQ Python wrappers

0.3.2
* Added a fixed delay line filter (FIR filter) with Python wrappers
* Added a universal fixed delay line filter (FIR, IIR and all-pass filter + combinations) with Python wrappers
* Added variable delay filters (delay is given on an additional port)

0.3.1
* Adding compressor elements and the Python wrappers
* Fixed Python wrappers by adding default number of port/channels
* Added restrict pointer support in all filter that can benefit from this

Buy Me a Coffee!

Other Amount:

Sam Stephens

Starting to Learn Django
Currently working on learning django. It does not seem to tricky but like with any package the devil is in the details. I will let you know how it goes.

July 07, 2014

Maheshakya Wijewardena

Optimizations on Locality sensitive hashing forest data structure

In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, optimization was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.

1. Memory usage of the data structure.
2. Speed of queries.

Let's discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.

Optimizations on memory usage

In the previous post about the implementation of the LSH forest, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points.

Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (1s and 0s). At the moment, these hashes are stored as binary strings.

One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string 001110 can be represented as 14. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set.

As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be 32 hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create $$g(p)$$(refer to the LSH Forest paper). Because the hashing is done separately for individual elements of $$g(p)$$, they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time.

In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that "you cannot eliminate the memory speed trade-off completely" holds.

Optimizations on query speeds

(I strongly recommend you to go through the LSH forest paper before reading the following section because the description involves terms directly taken from the paper)

I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.

Optimizations on the algorithm

In the initial implementation, binary hash: $$g(p)$$ for each tree is computed during the synchronous ascending phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For Random projections, it is a dot product and it is an expensive operation for large vectors. Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10).

def descend(query, hash_functions, trees):
bin_queries = []
max_depth = 0
for i in range(number_of_trees):
binary_query = do_hash(query, hash_functions[i]
k = find_longest_prefix_match(trees[i], binary_query)
if k > max_depth:
max_depth = k
bin_queries.append(bin_query)
return max_depth, binary_queries


It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until x > 0). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting x > lower_bound in the while loop for some value of lower_bound > 0. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable lower_bound for the queries at the time of initialization.

Optimizations on the implementation

I have indicated that bisect operations comes with Python are faster for small lists but as the size of the list grow numpy searchsorted functions becomes faster. In my previous implementation, I have used an alternative version of bisect_right function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my previous postif things are not clear). But we cannot create an alternative version of numpy searchsorted function, therefore a suitable transformation is required on the hash itself.

Suppose we have a binary hash item = 001110. What we need is the largest number with the first 4 bits being the same as item. 001111 suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s.

def transform_right(item, h, hash_size):
transformed_item = item[:h] + "".join(['1' for i in range(hash_size - h)])
return transformed_right


The transformation needed to get the left most index is simple. This is 001100 which is last two bits replaced by 0s. This is same as having 0011. Therefore only a string slicing operation item[:4] will do the the job.

It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer.

def integer_transform(int_item, hash_size, h):
str_item = ('{0:0'+str(hash_size)+'b}').format(int_item)
transformed_str = transform_right(str_item, h, hash_size):
transformed_int = int(transformed_str, 2)
return transformed_int


But this is a very expensive operation for a query. For indexing, only int(binary_hash, 2) is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.

Required integer representations for left and right operations can be obtained by performing the bit wise $$AND$$ and $$OR$$ operations with a suitable mask. Masks are generated by the following function.

def _generate_masks(hash_size):
for length in range(hash_size+1):
left_mask  = int("".join(['1' for i in range(length)])
+ "".join(['0' for i in range(hash_size-length)]), 2)
right_mask = int("".join(['0' for i in range(length)])
+ "".join(['1' for i in range(hash_size-length)]), 2)


These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.

def apply_masks(item, left_masks, right_masks, h):
item_left = item & left_masks[h]
item_right = item | right_masks[h]
return item_left, item_right


The left most and right most indices of a sorted array can be obtained in the following fashion.

import numpy as np
def find_matching_indices(sorted_array, item_left, item_right):
left_index = np.searchsorted(sorted_array, item_left)
right_index = np.searchsorted(sorted_array, item_right, side = 'right')
return np.arange(left_index, right_index)


As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.

Cached hash

This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.

Example:

Suppose the hash length is 3. Then the bit combinations are:

Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary.

The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be $$2^n = 4294967296$$ where $$n = 32$$. This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when $$n = 16$$, the size becomes $$65536$$. This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16.

First a list for digits 0 and 1 is created.

digits = ['0', '1']


The the cache is created as follows.

import itertools

cache_N = 32 / 2
c = 2 ** cache_N # compute once

cache = {
tuple(x): int("".join([digits[y] for y in x]),2)
for x in itertools.product((0,1), repeat=cache_N)
}


Suppose item is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that item as follows:

xx = tuple(item)
int_item = cache[xx[:16]] * c + cache[xx[16:]]


This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.

Performance of all these techniques were measured using Pythons' line_profiler and memory_profiler. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.

July 04, 2014

Matthew Rocklin

Streaming Analytics

tl;dr: We demonstrate data workflows with Python data structures and PyToolz. We also introduce join, a new operation in toolz.

Prelude

In my last two posts I show that Python data structures are fast and that CyToolz, an implementation of toolz in Cython, achieves Java speeds on standard Python core data structures like dicts, lists, and tuples. As a reminder, toolz provides functions like groupby

>>> from toolz import groupby

>>> names = ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank']
>>> groupby(len, names)
{3: ['Bob', 'Dan'],
5: ['Alice', 'Edith', 'Frank'],
7: ['Charlie']}


I always give this example when talking about toolz. It often spurs the following question:

That looks like GROUP BY from SQL. In what other ways does toolz let me do SQL-like operations in Python?

My answer for this is to go look at Pandas which really does a wonderful job at in-memory data analytics. Toolz targets functional programming more than it targets data analytics. Still this question is common enough to warrant a blogpost. The following is my stock answer on how to use pure Python and toolz (or cytoolz) for streaming data analytic workflows like selections, split-apply-combine, and joins. I’ll note throughout when operations are streaming (can support datasets bigger than memory) or not. This is one of the few ways in which analysis with toolz might be preferred over pandas.

Streaming Analytics

The toolz functions can be composed to analyze large streaming datasets. Toolz supports common analytics patterns like the selection, grouping, reduction, and joining of data through pure composable functions. These functions often have analogs to familiar operations in other data analytics platforms like SQL or Pandas.

Throughout this post we’ll use this simple dataset of accounts.

>>> #           id, name, balance, gender
>>> accounts = [(1, 'Alice', 100, 'F'),
...             (2, 'Bob', 200, 'M'),
...             (3, 'Charlie', 150, 'M'),
...             (4, 'Dennis', 50, 'M'),
...             (5, 'Edith', 300, 'F')]


Selecting with map and filter

Simple projection and linear selection from a sequence is achieved through the standard functions map and filter.

SELECT name, balance
FROM accounts
WHERE balance > 150;


These functions correspond to the SQL commands SELECT and WHERE.

>>> from toolz.curried import pipe, map, filter, get

>>> pipe(accounts, filter(lambda (id, name, balance, gender): balance > 150),
...                map(get([1, 2])),
...                list)


note: this uses the curried versions of map and reduce.

Of course, these operations are also well supported with standard list/generator comprehension syntax. This syntax is more often used and generally considered to be more Pythonic.

>>> [(name, balance) for (id, name, balance, gender) in accounts
...                  if balance > 150]


Split-apply-combine with groupby and reduceby

We separate split-apply-combine operations into the following two concepts

1. Split the dataset into groups by some property
2. Reduce each of the groups with some aggregation function

Toolz supports this common workflow with

1. a simple in-memory solution
2. a more sophisticated streaming solution.

In Memory Split-Apply-Combine

The in-memory solution depends on the functions groupby to split, and valmap to apply/combine.

SELECT gender, SUM(balance)
FROM accounts
GROUP BY gender;


We first show groupby and valmap separately to show the intermediate groups.

>>> from toolz import groupby, valmap, compose
>>> from toolz.curried import get, pluck

>>> groupby(get(3), accounts)
{'F': [(1, 'Alice', 100, 'F'), (5, 'Edith', 300, 'F')],
'M': [(2, 'Bob', 200, 'M'), (3, 'Charlie', 150, 'M'), (4, 'Dennis', 50, 'M')]}

>>> valmap(compose(sum, pluck(2)),
...        _)
{'F': 400, 'M': 400}


Then we chain them together into a single computation

>>> pipe(accounts, groupby(get(3)),
...                valmap(compose(sum, pluck(2))))
{'F': 400, 'M': 400}


Streaming Split-Apply-Combine

The groupby function collects the entire dataset in memory into a dictionary. While convenient, the groupby operation is not streaming and so this approach is limited to datasets that can fit comfortably into memory.

Toolz achieves streaming split-apply-combine with reduceby, a function that performs a simultaneous reduction on each group as the elements stream in. To understand this section you should first be familiar with the builtin function reduce.

The reduceby operation takes a key function, like get(3) or lambda x: x[3], and a binary operator like add or lesser = lambda acc, x: acc if acc < x else x. It successively applies the key function to each item in succession, accumulating running totals for each key by combining each new value with the previous total using the binary operator. It can’t accept full reduction operations like sum or min as these require access to the entire group at once. Here is a simple example:

>>> from toolz import reduceby

>>> def iseven(n):
...     return n % 2 == 0

>>> def add(x, y):
...     return x + y

>>> reduceby(iseven, add, [1, 2, 3, 4])
{True: 6, False: 4}


The even numbers are added together (2 + 4 = 6) into group True, and the odd numbers are added together (1 + 3 = 4) into group False.

Note that we have to replace the reduction sum with the binary operator add. The incremental nature of add allows us to do the summation work as new data comes in. The use of binary operators like add over full reductions like sum enables computation on very large streaming datasets.

The challenge to using reduceby often lies in the construction of a suitable binary operator. Here is the solution for our accounts example that adds up the balances for each group:

>>> binop = lambda total, (id, name, bal, gend): total + bal

>>> reduceby(get(3), binop, accounts)
{'F': 400, 'M': 400}


This construction supports datasets that are much larger than available memory. Only the output must be able to fit comfortably in memory and this is rarely an issue, even for very large split-apply-combine computations.

Semi-Streaming join

We register multiple datasets together with join. Consider a second dataset that stores addresses by ID:

>>> addresses = [(1, '123 Main Street'),  # id, address
...              (2, '5 Adams Way'),
...              (5, '34 Rue St Michel')]


We can join this dataset against our accounts dataset by specifying attributes which register different elements with each other; in this case they share a common first column, id.

SELECT accounts.name, addresses.address
WHERE accounts.id = addresses.id;

>>> from toolz import join, first, second

>>> result = join(first, accounts,

>>> for ((id, name, bal, gender), (id, address)) in result:
('Alice', '123 Main Street')
('Bob', '5 Adams Way')
('Edith', '34 Rue St Michel')


Join takes four main arguments, a left and right key function and a left and right sequence. It returns a sequence of pairs of matching items. In our case the return value of join is a sequence of pairs of tuples such that the first element of each tuple (the ID) is the same. In the example above we unpack this pair of tuples to get the fields that we want (name and address) from the result.

Join on arbitrary functions / data

Those familiar with SQL are accustomed to this kind of join on columns. However a functional join is more general than this; it doesn’t need to operate on tuples, and key functions do not need to get particular columns. In the example below we match numbers from two collections so that exactly one is even and one is odd.

>>> def iseven(n):
...     return n % 2 == 0
>>> def isodd(n):
...     return n % 2 == 1

>>> list(join(iseven, [1, 2, 3, 4],
...           isodd, [7, 8, 9]))
[(2, 7), (4, 7), (1, 8), (3, 8), (2, 9), (4, 9)]


Semi-Streaming Join

The Toolz Join operation fully evaluates the left sequence and streams the right sequence through memory. Thus, if streaming support is desired the larger of the two sequences should always occupy the right side of the join.

Algorithmic Details

The semi-streaming join operation in toolz is asymptotically optimal. Computationally it is linear in the size of the input + output. In terms of storage the left sequence must fit in memory but the right sequence is free to stream.

The results are not normalized, as in SQL, in that they permit repeated values. If normalization is desired, consider composing with the function unique (note that unique is not fully streaming.)

More Complex Example

The accounts example above connects two one-to-one relationships, accounts and addresses; there was exactly one name per ID and one address per ID. This need not be the case. The join abstraction is sufficiently flexible to join one-to-many or even many-to-many relationships. The following example finds city/person pairs where that person has a friend who has a residence in that city. This is an example of joining two many-to-many relationships because a person may have many friends and because a friend may have many residences.

>>> friends = [('Alice', 'Edith'),
...            ('Alice', 'Zhao'),
...            ('Edith', 'Alice'),
...            ('Zhao', 'Alice'),
...            ('Zhao', 'Edith')]

>>> cities = [('Alice', 'NYC'),
...           ('Alice', 'Chicago'),
...           ('Dan', 'Syndey'),
...           ('Edith', 'Paris'),
...           ('Edith', 'Berlin'),
...           ('Zhao', 'Shanghai')]

>>> # Vacation opportunities
>>> # In what cities do people have friends?
>>> result = join(second, friends,
...               first, cities)
>>> for ((name, friend), (friend, city)) in sorted(unique(result)):
...     print((name, city))
('Alice', 'Berlin')
('Alice', 'Paris')
('Alice', 'Shanghai')
('Edith', 'Chicago')
('Edith', 'NYC')
('Zhao', 'Chicago')
('Zhao', 'NYC')
('Zhao', 'Berlin')
('Zhao', 'Paris')


Join is computationally powerful:

• It is expressive enough to cover a wide set of analytics operations
• It runs in linear time relative to the size of the input and output
• Only the left sequence must fit in memory

Conclusion

Toolz gives a compact set of primitives for data analysis on plain Python data structures. CyToolz accelerates those workflows through Cython. This approach is both low-tech and supports uncomfortably big data through streaming.

At the same time, Toolz is a general purpose functional standard library, and is not specifically dedicated to data analysis. While there are obvious benefits (streaming, composition, etc.) users interested in data analysis might be better served by using projects dedicated projects like Pandas or SQLAlchemy.

This post is also part of the toolz docs. Thanks to John Jacobsen, Clark Fitzgerald, and Erik Welch for their help with this post.

July 01, 2014

Matthieu Brucher

ATK SD1 implementation

I just released my SD1 simulation, and now it is time to explain why is inside this plugin. Not everything in the original pedal was simulated, and different schemes were used to tackle the different stages.

Block diagram

SD1 annotated schematic

In blue, I displayed what I actually implemented:

• the drive section, what creates the SD1 specific tone
• the tone section, with its specific filter
• the level section, simple enough

In violet, what I didn’t implement:

• the input buffer stage, because it is mainly resistors, so no spectral differences
• the output buffer stage, for the same reasons, except for the HP filter that I implemented differently
• the latch to switch the effect on/bypassed

Drive section

As all drive sections, there are different ways of handling the non linearities. As for me, I like to implement it as an ODE and solve it with a trapezoidal approximation and a Newton-Raphson optimization. The equations are a little bit different from the Simple Overdrive, but they are quite close. The main difference is that the clipper is not symmetric, which means it will introduce not only odd harmonics, but also even harmonics.

Let’s start with a visual comparison of the result, compared to the simple overdrive:

 Simple overdrive for a 100Hz sinus SD1 overdrive for a 100Hz sinus

The two signals look quite different, and for a good reason: the SD1 simulation adds the distorted signal to the original one. But still, the simple overdrive had one sharp edge, and then a slow one, whereas the SD1 has two sharp edges. It also translates in the numerical behavior of the filter: if the simple overdrive behaved nicely on a sin sweep with an oversampling of x4, without going over 10 iterations, SD1 doesn’t converge in less than 50 iterations for frequencies higher than 7 kHz!

Let’s now compare the spectrum of those two overdrives:

Sine sweep behavior for the simple overdrive section

Sine sweep behavior for the SD1 overdrive section

It is clear that the SD1 section is noisier than the simple overdrive (oversampling x8). You can still see the small even harmonics on the SD1 spectrogram. If the drive level is turned down, the even harmonics are actually more pronounced than with drive level turned up.

Tone section

It’s now time for the tone section. Let’s just display the different spectrums:

Spectrum of the SD1 tone section

In all cases, the high frequencies are attenuated a little bit, but the main effect of the tone section is handling the medium frequencies.

Wrap-up

The SD1 plugin consists of several ATK plugins:

• Drive section
1. Oversampling filter (6 points, order 5) x8
2. SD1 overdrive
3. Butterworth low-pass filter (Fc=20kHz, order 6)
4. Decimation filter
• Tone section
1. SD1 tone filter
2. High-pass filter (Chamberlin filter, Fc=20Hz)
• Level section (simple volume filter)

With the small oversampling, the computation cost is manageable for a DAW usage (IMHO).

Thanks to Florent Keller for the GUI design and Oli Larkin for WDL-OL (VST2/VST3/AU support)

Buy Me a Coffee!

Other Amount:

numfocus

John Hunter Technology Fellowship awarded

We’re thrilled to announce that the 2014 John Hunter Fellowship has been awarded to Olga Botvinnik, UC San Diego. The focus of her Fellowship will be to create open source analysis software for the single-cell and biology communities, and to pioneer data, code sharing, and computational reproducibility within the single-cell and RNA biology communities. She will take up this Fellowship in the second half of 2014, and collaborate with her academic mentor Gene Yeo and software mentor C. Titus Brown. You can read more about Olga’s work at her site.

The selection process was not an easy one - there were a number of high quality applicants that we would have loved to be able to award a fellowship. Honorable Mentions go to Damien Irving (University of Melbourne) and Connie Gao (MIT).

This first Numfocus Fellowship Program has gotten off to a promising start. We’re aiming to expand the program in the coming years to 5-10 fellowships each year.

June 28, 2014

Titus Brown

Replication and quality in science - another interview

Nik Sultana, a postdoc in Cambridge, asked me some questions via e-mail, and I asked him if it would be OK for me to publish them on my blog. He said yes, so here you go!

1. How is the quality of scientific software measured? Is there a "bug index", where software loses points if it's found to contain serious bugs in, say, a 6-month period? If not a "bug index" (with negative points) then something like an "openness index" perhaps, with points gained by more better-quality code being available?

There is no formal measurement. I've put some points here -- http://ivory.idyll.org/blog/ladder-of-academic-software-notsuck.html. Happy to discuss more, but there's nothing formal ;)

1. The attendee list for the CW14 workshop seemed to include more bio places represented. Are there any subjects/communities within science that are more conscious of the need to share code and replicate computations than others, in your experience?

Hmm, genomics and bioinformatics are actually pretty good. I think the code quality sucks, but the point of sharing is obvious to genomicists -- so much of what we know about biology comes not from de novo analysis but rather from comparative analysis across species and experiments.

Chemists seem to be worse.

1. I only came across initiatives such as SSI, myexperiment, figshare, datadryad, etc after I submitted my thesis. Do you think I should ask the uni (and previous places where I studied) for my money back? (No just kidding, please skip to the next question.)

Yes. :)

1. What reassurance would you give to members of the public regarding the state of reproducibility in science? In recent writing on scientific reproducibility it is often pointed out that there have been a spate of rather high-profile retractions and corrections. How do we know there isn't some research-equivalent of Bernie Madoff somewhere?

I've seen many assurances that most unreproducible stuff seems to be unintentional - the result of sloppy work habits, etc. Moreover, in my experience, most scientists are clearly aware of the goal of reproducibility, they just don't know how to do it (esp in computational work). That having been said, anyone who fabricates data or results is going to keep very quiet about it, and I've certainly heard a number of stories.

A Bernie Madoff-style scandal is tough in a field where no one really has that much money or that high a profile. That having been said, look up Hendrik Schoen and Diederik Stapel...

1. You've done a PhD, and have worked on various initiatives on establishing good practices in science since then. What one-sentence piece of advice would you give to young researchers starting out in science, on the theme of reproducibility of their experiments?

Just do it; your life and career will be richer for it.

1. Do you think that the currently established system in academia or industrial research takes openness + sharing into account for career advancement? What incentives are in place, or what incentives should be made in place?

Hahahahahahaha!

No. This is a longer discussion -- I can send you another discussion I had with a Science magazine journalist if you're interested (since posted -- see my response to Eli Kintisch) -- but everything is indirect. My openness has been great but only in an indirect fashion, in the sense that my primary metrics (grants and papers) have benefitted.

For true change, the funding agencies and journal article reviewers need to provide the incentives. NIH is starting to step up. Also see the new REF guidelines in the UK re open access -- you can see how the incentives are working there.

1. Even if we had stable and painless-to-use technology and procedures for sharing code + data, there might be hold-back for two reasons:

• Commercial interests might be at conflict with openness, since the latter can disadvantage commercial exploitation.
• Scientists might fear giving other scientists an advantage, or having a mistake found in their work.

Are these fundamental limitations of the "human science system", as it were, or are there ways around them do you think?

For (a), openness does not conflict with intellectual property rights. So I think this is simply not a concern.

For (b), those scientists seem to miss the point of science, which is to build upon others' work. This is where incentives to be open can be most useful.

1. Do you track, or have you come across or heard of, similar infrastructure needs in humanities subjects? Presumably these subjects do need storage and collaboration technology, but maybe don't need computation as much as scientific subjects.

Digital Humanities. Talk to Ethan Watrall.

--titus

Randy Olson

How to make beautiful data visualizations in Python with matplotlib

It’s been well over a year since I wrote my last tutorial, so I figure I’m overdue. This time, I’m going to focus on how you can make beautiful data visualizations in Python with matplotlib.

There are already tons of tutorials on how to make basic plots in matplotlib. There’s even a huge example plot gallery right on the matplotlib web site, so I’m not going to bother covering the basics here. However, one aspect that’s missing in all of these tutorials and examples is how to make a nice-looking plot.

Below, I’m going to outline the basics of effective graphic design and show you how it’s done in matplotlib. I’ll note that these tips aren’t limited to matplotlib; they apply just as much in R/ggplot2, matlab, Excel, and any other graphing tool you use.

Less is more

The most important tip to learn here is that when it comes to plotting, less is more. Novice graphical designers often make the mistake of thinking that adding a cute semi-related picture to the background of a data visualization will make it more visually appealing. (Yes, that graphic was an official release from the CDC.) Or perhaps they’ll fall prey to more subtle graphic design flaws, such as using an excess of chartjunk that their graphing tool includes by default.

At the end of the day, data looks better naked. Spend more time stripping your data down than dressing it up. Darkhorse Analytics made an excellent GIF to explain the point:

(click on the GIF for a gfycat version that allows you to move through it at your own pace)

Antoine de Saint-Exupery put it best:

Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.

You’ll see this in the spirit of all of my plots below.

Color matters

The default color scheme in matplotlib is pretty ugly. Die-hard matlab/matplotlib fans may stand by their color scheme to the end, but it’s undeniable that Tableau’s default color scheme is orders of magnitude better than matplotlib’s.

Use established default color schemes from software that is well-known for producing beautiful plots. Tableau has an excellent set of color schemes to use, ranging from grayscale to colored to color blind-friendly. Which brings me to my next point…

Many graphic designers completely forget about color blindness, which affects over 5% of the viewers of their graphics. For example, a plot using red and green to differentiate two categories of data is going to be completely incomprehensible for anyone with red-green color blindness. Whenever possible, stick to using color blind-friendly color schemes, such as Tableau’s “Color Blind 10.”

Required libraries

You’ll need the following Python libraries installed to run this code, and all of the code should be run inside an IPython Notebook:

• IPython
• matplotlib
• pandas

The Anaconda Python distribution provides an easy double-click installer that includes all of the libraries you’ll need.

Blah, blah, blah… let’s get to the code

Now that we’ve covered the basics of graphic design, let’s dive into the code. I’ll explain the “what” and “why” of each line of code with inline comments.

Line plots

%pylab inline
from pandas import read_csv

# Read the data into a pandas DataFrame.

# These are the "Tableau 20" colors as RGB.
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
(44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
(148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
(227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
(188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]

# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.
for i in range(len(tableau20)):
r, g, b = tableau20[i]
tableau20[i] = (r / 255., g / 255., b / 255.)

# You typically want your plot to be ~1.33x wider than tall. This plot is a rare
# exception because of the number of lines being plotted on it.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 14))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()

# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
ylim(0, 90)
xlim(1968, 2014)

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
yticks(range(0, 91, 10), [str(x) + "%" for x in range(0, 91, 10)], fontsize=14)
xticks(fontsize=14)

# Provide tick lines across the plot to help your viewers trace along
# the axis ticks. Make sure that the lines are light and small so they
# don't obscure the primary data lines.
for y in range(10, 91, 10):
plot(range(1968, 2012), [y] * len(range(1968, 2012)), "--", lw=0.5, color="black", alpha=0.3)

# Remove the tick marks; they are unnecessary with the tick lines we just plotted.
plt.tick_params(axis="both", which="both", bottom="off", top="off",
labelbottom="on", left="off", right="off", labelleft="on")

# Now that the plot is prepared, it's time to actually plot the data!
# Note that I plotted the majors in order of the highest % in the final year.
majors = ['Health Professions', 'Public Administration', 'Education', 'Psychology',
'Foreign Languages', 'English', 'Communications\nand Journalism',
'Art and Performance', 'Biology', 'Agriculture',
'Social Sciences and History', 'Business', 'Math and Statistics',
'Architecture', 'Physical Sciences', 'Computer Science',
'Engineering']

for rank, column in enumerate(majors):
# Plot each line separately with its own color, using the Tableau 20
# color set in order.
plot(gender_degree_data.Year.values,
gender_degree_data[column.replace("\n", " ")].values,
lw=2.5, color=tableau20[rank])

# Add a text label to the right end of every line. Most of the code below
# is adding specific offsets y position because some labels overlapped.
y_pos = gender_degree_data[column.replace("\n", " ")].values[-1] - 0.5
if column == "Foreign Languages":
y_pos += 0.5
elif column == "English":
y_pos -= 0.5
elif column == "Communications\nand Journalism":
y_pos += 0.75
elif column == "Art and Performance":
y_pos -= 0.25
elif column == "Agriculture":
y_pos += 1.25
elif column == "Social Sciences and History":
y_pos += 0.25
elif column == "Business":
y_pos -= 0.75
elif column == "Math and Statistics":
y_pos += 0.75
elif column == "Architecture":
y_pos -= 0.75
elif column == "Computer Science":
y_pos += 0.75
elif column == "Engineering":
y_pos -= 0.25

# Again, make sure that all labels are large enough to be easily read
# by the viewer.
text(2011.5, y_pos, column, fontsize=14, color=tableau20[rank])

# matplotlib's title() call centers the title on the plot, but not the graph,
# so I used the text() call to customize where the title goes.

# Make the title big enough so it spans the entire plot, but don't make it
# so big that it requires two lines to show.

# Note that if the title is descriptive enough, it is unnecessary to include
# axis labels; they are self-evident, in this plot's case.
text(1995, 93, "Percentage of Bachelor's degrees conferred to women in the U.S.A."
", by major (1970-2012)", fontsize=17, ha="center")

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
text(1966, -8, "Data source: nces.ed.gov/programs/digest/2013menu_tables.asp"
"\nAuthor: Randy Olson (randalolson.com / @randal_olson)"
"\nNote: Some majors are missing because the historical data "
"is not available for them", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("percent-bachelors-degrees-women-usa.png", bbox_inches="tight");


Line plots with error bars

%pylab inline
from pandas import read_csv
from scipy.stats import sem

# This function takes an array of numbers and smoothes them out.
# Smoothing is useful for making plots a little easier to read.
def sliding_mean(data_array, window=5):
data_array = array(data_array)
new_list = []
for i in range(len(data_array)):
indices = range(max(i - window + 1, 0),
min(i + window + 1, len(data_array)))
avg = 0
for j in indices:
avg += data_array[j]
avg /= float(len(indices))
new_list.append(avg)

return array(new_list)

# Due to an agreement with the ChessGames.com admin, I cannot make the data
# for this plot publicly available. This function reads in and parses the
# chess data set into a tabulated pandas DataFrame.

# These variables are where we put the years (x-axis), means (y-axis), and error bar values.
# We could just as easily replace the means with medians,
# and standard errors (SEMs) with standard deviations (STDs).
years = chess_data.groupby("Year").PlyCount.mean().keys()
mean_PlyCount = sliding_mean(chess_data.groupby("Year").PlyCount.mean().values,
window=10)
sem_PlyCount = sliding_mean(chess_data.groupby("Year").PlyCount.apply(sem).mul(1.96).values,
window=10)

# You typically want your plot to be ~1.33x wider than tall.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 9))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()

# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
ylim(63, 85)

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
xticks(range(1850, 2011, 20), fontsize=14)
yticks(range(65, 86, 5), fontsize=14)

# Along the same vein, make sure your axis labels are large
# enough to be easily read as well. Make them slightly larger
# than your axis tick labels so they stand out.
ylabel("Ply per Game", fontsize=16)

# Use matplotlib's fill_between() call to create error bars.
# Use the dark blue "#3F5D7D" as a nice fill color.
fill_between(years, mean_PlyCount - sem_PlyCount,
mean_PlyCount + sem_PlyCount, color="#3F5D7D")

# Plot the means as a white line in between the error bars.
# White stands out best against the dark blue.
plot(years, mean_PlyCount, color="white", lw=2)

# Make the title big enough so it spans the entire plot, but don't make it
# so big that it requires two lines to show.
title("Chess games are getting longer", fontsize=22)

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
xlabel("\nData source: www.ChessGames.com | "
"Author: Randy Olson (randalolson.com / @randal_olson)", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("evolution-of-chess/chess-number-ply-over-time.png", bbox_inches="tight");


Histograms

%pylab inline
from pandas import read_csv

# Due to an agreement with the ChessGames.com admin, I cannot make the data
# for this plot publicly available. This function reads in and parses the
# chess data set into a tabulated pandas DataFrame.

# You typically want your plot to be ~1.33x wider than tall.
# Common sizes: (10, 7.5) and (12, 9)
figure(figsize=(12, 9))

# Remove the plot frame lines. They are unnecessary chartjunk.
ax = subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()

# Make sure your axis ticks are large enough to be easily read.
# You don't want your viewers squinting to read your plot.
xticks(fontsize=14)
yticks(range(5000, 30001, 5000), fontsize=14)

# Along the same vein, make sure your axis labels are large
# enough to be easily read as well. Make them slightly larger
# than your axis tick labels so they stand out.
xlabel("Elo Rating", fontsize=16)
ylabel("Count", fontsize=16)

# Plot the histogram. Note that all I'm passing here is a list of numbers.
# matplotlib automatically counts and bins the frequencies for us.
# "#3F5D7D" is the nice dark blue color.
# Make sure the data is sorted into enough bins so you can see the distribution.
hist(list(chess_data.WhiteElo.values) + list(chess_data.BlackElo.values),
color="#3F5D7D", bins=100)

# Always include your data source(s) and copyright notice! And for your
# data sources, tell your viewers exactly where the data came from,
# preferably with a direct link to the data. Just telling your viewers
# that you used data from the "U.S. Census Bureau" is completely useless:
# the U.S. Census Bureau provides all kinds of data, so how are your
# viewers supposed to know which data set you used?
text(1300, -5000, "Data source: www.ChessGames.com | "
"Author: Randy Olson (randalolson.com / @randal_olson)", fontsize=10)

# Finally, save the figure as a PNG.
# You can also save it as a PDF, JPEG, etc.
# Just change the file extension in this call.
# bbox_inches="tight" removes all the extra whitespace on the edges of your plot.
savefig("chess-elo-rating-distribution.png", bbox_inches="tight");


Easy interactives

As an added bonus, thanks to plot.ly, it only takes one more line of code to turn your matplotlib plot into an interactive.

More Python plotting libraries

In this tutorial, I focused on making data visualizations with only Python’s basic matplotlib library. If you don’t feel like tweaking the plots yourself and want the library to produce better-looking plots on its own, check out the following libraries.

Edward Tufte has been a pioneer of the “simple, effective plots” approach. Most of the graphic design of my visualizations has been inspired by reading his books.

The Visual Display of Quantitative Information is a classic book filled with plenty of graphical examples that everyone who wants to create beautiful data visualizations should read.

Envisioning Information is an excellent follow-up to the first book, again with a plethora of beautiful graphical examples.

There are plenty of other books out there about beautiful graphical design, but the two books above are the ones I found the most educational and inspiring.

Want to know how I made any of my other plots? Leave a comment and put in a request.

Maheshakya Wijewardena

An illustration of the functionality of the LSH Forest

Before digging into more technical detail on the implementation of LSH forest, I thought it would be useful to provide an insightful illustration on how the best candidates are selected from the fitted data points.

For this task, I decided to use a randomly generated dataset of the sample size of 10000 in the two dimensional space. The reason of using two dimensions is that it is convenient to depict the vectors in 2D space and the a normal human being can easily grasp the idea of convergence in a 2D space.

Following configuration of the LSH forest has been selected for this illustration.

• Number of trees = 1
• Hash length= 32
• c = 1
• Lower bound of hash length at termination = 4
• Expecting number of neighbors = 10

You can get an idea of these parameters from my previous article on the LSH forest

The following illustrations show the convergence of data points towards the query point as the considered hash length is increased. The important fact I want to emphasize here is that candidates chosen by matching the most significant hash bits converge into the actual data point we are interested in. This happens because of the amplification property of Locality sensitive hashing algorithm.

(Beware! The query point is in RED)

Only if the required number of candidates are not reached during the early sweeps, the algorithm will search for more candidates in smaller matching hash lengths. The the best neighbors are selected from that set of candidates by calculating the actual distance.

In my next post, I will enlighten you about the subtle optimizations done on the LSH forest data structure to find the best candidates at a maximum speed.

June 27, 2014

Philip Herron

redbrain812

Hey so one thing that bugged me and probably others if you join a company on a huge legacy code base and the code formatting or style is all over the place. Example tabs vs spaces, gnu vs linux vs something else.And you open it in vim or in gedit or emacs and it all looks different in each editor.

So i decided to learn a little lisp:

(defun format-me ()
(c-set-style "linux") ;; preferred c style
(indent-region (point-min) (point-max) nil) ;; format it
(untabify (point-min) (point-max)) ;; untabify
(save-buffer) ;;save
)


So this emacs lisp function will do the formatting i want on a specified file only trouble i had was, i was unable to figure out how to in emacs lisp recursively walk the directory tree to find all the *.c *.cc *.cpp *.h *.hh to run this on, so i ended up using bash so example to run this code from bash on a file i simple run:

$emacs -batch <file-to-format> -l ~/format-lisp-code.el -f format-me  This runs and does it in 1 go so all i had to do now was find the files i wanted to reformat: $ find -type f -regex ".*/.*\.$$c\|cpp\|cc\|h\|hh$$"


This returned me all i cared about in our code base then all i had to do was run:

for i in find -type f -regex ".*/.*\.$$c\|cpp\|cc\|h\|hh$$"; do emacs -batch $i -l ~/format-lisp-code.el -f format-me; done  And voila magic emacs you are so amazing! June 26, 2014 Titus Brown What about all those genes of unknown function? I'm at the 2014 Marine Microbes Gordon Conference right now, and at the end of my talk, I brought up the point that the function of most genes is unknown. It's not a controversial point in any community that does environmental sequencing, but I feel it should be mentioned at least once during every session on metagenome sequencing :). The lack of functional information for the vast majority of genes is, in my view, the broadest and biggest challenge facing environmental microbiology. Known colloquially as "microbial dark matter" (ref and Nature News), it is fair to say that we have virtually no window into what the majority of genes do. This is particularly a problem now that we can readily access them with sequencing, and several fields are racking up hundreds or thousands of data sets that are largely uninterpretable from a functional perspective. So what are our options? What can we do to characterize new genes? There seem to be two poles of opinions: many experimentalists argue that we need to devote significant efforts to doing more microbial physiology, which is, after all, how we know most of what we already know. People at the other pole seems to think that if we do enough sequencing, eventually meaning will emerge - enough correlations will turn into causation. (While these are obviously caricatures, I think they capture most of the range of opinions I've heard, so I like 'em ;). Neither course seems likely to me. Nobody is going to fund hundreds or thousands of graduate student projects to characterize the physiology of individual microbial species, which is more or less the scale of effort needed. Similarly, while the sequencing folk have clearly been "winning" (in the sense that there's a lot of sequencing being done!) there's a growing backlash against large-scale sequencing without a fairly pointed set of hypotheses behind them. This backlash can be understand as a natural development -- the so-called trough of disillusionment in the adoption and understanding of new technologies -- but that makes it no less real. Over the past few years, I've had the opportunity to discuss and debate a variety of approaches to characterizing gene function in microbes. Since I'm thinking about it a lot during this meeting, I thought I'd take the time to write down as many of the ideas as I can remember. There are two purposes to this -- first, I'm trawling for new ideas, and second, maybe I can help inspire people to tackle these challenges! Without further ado, here is my list, broken down into somewhat arbitrary categories. Experimental exploration and determination of gene function 1. Finding genetic systems and doing the hard work. This is essentially the notion that we should focus in on a few model systems that are genetically tractable (culturable, transformable, and maybe even possessed of genome editing techniques) and explore, explore, explore. I'm not sure which microbes are tractable, or could be made tractable, but I gather we are lacking model systems representative of a broad swath of marine microbes, at least. The upsides to this approach are that we know how to do it, and all of the modern -omics tools can be brought to bear to accelerate progress: genome, transcriptome, and proteome sequencing, as well as metabolomics. The downsides are that this approach is slow to start, time consuming, and not particularly scalable. Because of that I'm not sure there's much support for funding. 2. Transcriptome assisted culture. A persistent challenge for studying microbes is that many of them cannot be easily cultured, which is a soft prerequisite for studying them in the lab. We can't culture them because often we don't know what the right culture conditions are -- what do they eat? Who do they like to hang out with? One of the neater approaches to resolving this is the concept of transcriptome assisted culture, which Irene Newton (@chicaScientific) pointed out to me in this neat PNAS paper on culturing Coxiella. Essentially, Omsland et al. used transcriptome sequencing in conjunction with repeated modifications to the culture medium to figure out what the right growth medium was. In addition to converting an important biomedical pathogen into something more easily culturable, the authors gained important insights into its basic metabolism and the interaction of Coxiella with its host cells. Upsides? It's an approach that's readily enabled by modern -omics tools, and it should be broadly applicable. Downsides? Time consuming and probably not that scalable. However, it's a rather sexy approach to the hard slog of understanding organisms (and you can argue that it's basically the same as the model organism approach) so it's marginally more fundable than just straight up physiology studies. 3. Enrichment cultures. Another culture-based approach is the enrichment culture, in which a complex microbial community (presumably capable of driving many different biogeochemical processes) is grown in a more controlled environment, usually one enriched for a particular kind of precursor. This can be done with a flow reactor approach where you feed in precursors and monitor the composition of the outflow, or just by adding specific components to a culture mix and seeing what grows. For one example of this approach, see Oremland et al., 2005, in which the authors isolated a microbe, Halarsenatibacter silvermanii, which metabolized arsenic. They did this by serial transfer of the cells into a fresh medium and then purifying the organism that persistently grew through serial dilution at the end. This is a bit of a reverse to the previous methods, where the focus was on a specific organism and figuring out how it worked; here, you can pick a condition that you're interested in and figure out what grows in it. You can get both simplified communities and potentially even isolates that function in specific conditions. (Also see Winogradsky columns for a similar environment that you could study.) You still need to figure out what the organisms do and how they do it, but you start with quite a bit more information and technology than you would otherwise - most importantly, the ability to maintain a culture! Pros: this is actually a lot more scalable than the model-organism or culture-focused techniques above. You could imagine doing this on a large scale with a fairly automated setup for serial transfer, and the various -omics techniques could yield a lot of information for relatively little per-organism investment. Someone would still need to chase down the genes and functions involved, but I feel like this could be a smallish part of a PhD at this point. Cons: it's not terribly hypothesis driven, which grant agencies don't always like; and you might find that you don't get that much biological novelty out of the cultures. 4. Functional metagenomics You can also understand what genes do by putting them into tractable model organisms. For example, one of the ways that Ed DeLong's group showed that proteorhodopsin probably actually engaged in photosynthesis was by putting the gene in E. coli. At the time, there was no way to investigate the gene (from an uncultured SAR86) in its host organism, so this was the only way they could "poke" at it. A significant and important extension of this idea is to transfer random fragments from metagenomic fosmid or BAC libraries into large populations of (say) E. coli, and then do a selection experiment to enrich for those critters that can now grow in new conditions. For example, see this paper on identifying the gene behind the production of certain antibiotics (hat tip to Juan Ugalde (@JuanUgaldeC for the reference). Also see the "heterologous expression" paragraph in Handelsman (2004), or this other antibiotic resistance paper from Riesenfeld et al. (2004) (hat tips to Pat Schloss (@Pat Schloss), Jeff Gralnick (@bacteriality), and Daan Speth (@daanspeth) for the refs!). Pros: when it works, it's awesome! Cons: most genes function in pathways, and unless you transfer in the whole pathway, an individual gene might not do anything. This has been addressed by transferring entire fosmids with whole operons on them between microbes, and I think this is still worth trying, but (to me) it seems like a low-probability path to success. I could be wrong. 5. Synthetic biology Why not just build a new critter genome using synthetic biology approaches, and see how it works? This is a radical extension of the previous idea of transferring genes between different organisms. Since we can now print long stretches of DNA on demand, why not engineer our own pathways and put them into tractable organisms to study in more detail? I think this is one of the more likely ideas to ultimately work out, but it has a host of problems. For one thing, you need to have strong and reliable predictions of gene function. For another, not all microbes will be able to execute all pathways, for various biochemical reasons. So I expect the failure rate of this approach to be quite high, at least at first. Pros: when it works, it'll be awesome! And, unlike the functional metagenomics approach, you can really engineer anything you want - you don't need to find all your components already assembled in a PCR product or fosmid. Cons: expensive at first, and likely to have a high failure rate. Unknown scalability, but probably can be heavily automated, especially if you use selection approaches to enrich for organisms that work (see previous item). Computational exploration and determination of gene function 1. Metabolic modeling Look at the genome, feed it into a model of metabolism, and try to understand what genes are doing and what genes are missing. Metabolic flux analysis provides one way to quickly identify whether a given gene complement is sufficient to "explain" observed metabolism, but I'm unsure of how well it works for badly annotated genomes (my guess? badly ;). You can marry this kind of metabolic analysis with the kind of nifty fill-in-the-blank work that Valerie de Crecy-Lagard does -- I met Valerie a few years back on a visit to UFL, and thought, hey, we need hundreds of people like her! Valerie tracks down "missing" pathway genes in bacterial genomes, using a mixture of bioinformatics and experimental techniques. This is going to be important if you're predicting metabolic activity based on the presence/absence of annotated genes. In practice, this is going to be much easier in organisms that are phylogenetically closer to model systems, where we can make better use of homology to identify likely mis-annotated or un-annotated genes. It also doesn't help us identify completely new functions except by locating missing energy budgets. Pros: completely or largely computational and hence potentially quite scalable. Cons: completely or largely computational, so unlikely to work that well :). Critically dependent on prior information, which we already know is lacking. And hard or impossible to validate; until you get to the point where on balance the predictions are not wrong, it will be hard to get people to consider the expense of validation. 2. Gene-centric metabolic modeling Rather than trying to understand how a complete microbe works, you can take your cue from geochemistry and try to understand how a set of genes (and transcripts, and proteins) all cooperate to execute the given biogeochemistry. The main example I know of this is from Reed et al. 2013, with Julie Huber (@JulesDeep) and Greg Dick. Pros: completely or largely computational and hence potentially quite scalable. Cons: requires a fair bit of prior information. But perhaps easier to validate, because you get predictions that are tied closely to a particular biogeochemistry that someone already cares about. 3. Sequence everything and look for correlations. This is the quintessential Big Data approach: if we sequence everything, and then correlate gene presence/absence/abundance with metadata and (perhaps) a smattering of hypotheses and models, then we might be able to guess at what genes are doing. Aaron Garoutte (@AaronGoroutte) made the excellent point that we could use these correlations as a starting point to decide which genes to invest more time and energy in analyzing. When confronted with 100s of thousands of genes -- where do you start? Maybe with the ones that correlate best with the environmental features you're most interested in ... Pros: we're doing the sequencing anyway (although it's not clear to me that the metadata is sufficient to follow through, and data availability is a problem). Does not rely on prior information at all. Cons: super unlikely to give very specific predictions; much more likely to provide a broad range of hypotheses, and we don't have the technology or scientific culture to do this kind of work. 4. Look for signatures of positive selection across different communities. This is an approach suggested by Tom Schmidt and Barry Williams, for which there is a paper soon to be submitted by Bjorn Ostman and Tracy Teal et al. The basic idea is to look for signatures of adaptive pressures on genes in complex metagenomes, in situations where you believe you know what the overall selection pressure is. For example, in nitrogen-abundant situations you would expect different adaptive pressures on genes than in more nitrogen-limited circumstances, so comparisons between fertilized and unfertilized soils might yield something interesting. Pros: can suggest gene function without relying on any functional information at all. Cons: unproven, and the multiple-comparison problem with statistics might get you. Also, needs experimental confirmation! My favorite idea - a forward evolutionary screen 1. Here's an idea that I've been kicking around for a while with (primarily) Rich Lenski (@RELenski), based on some Campylobacter work with JP Jerome and Linda Mansfield. Take fast evolving organisms (say, pathogens), and evolve them in massive replicate on a variety of different carbon sources or other conditions (plates vs liquid; different host genotypes; etc.) and wait until they can't cross-grow. Then, sequence their genomes and figure out what genes have been lost. You can now assume that genes that are lost are not important for growing in those other conditions, and put them in a database for people to query when they want to know what a gene might not be important for. We saw just this behavior in Campylobacter when we did serial transfer in broth, and then plated it on motility assay plates: Campy lost its motility genes, first reversibly (regulation) and then irreversibly (conversion to pseudogene). Harriet Alexander (@nekton4plankton) pointed out to me that this bears some similarity to the kinds of transposon mutagenesis experiments that were done in many model organisms in the 90s - basically, forward genetics. Absolutely! I have to think through how useful forward genetics would be in this field a bit more thoroughly, though. Pros: can be automated and can scale; takes advantage of massive sequencing; should find lots of genes. Cons: potentially quite expensive; unlikely to discover genes specific to particular conditions of interest; requires a lot of effort for things to come together. So that's my list. Can't we all get along? A need for complementary approaches. I doubt there's a single magical approach, a silver bullet, that will solve the overall problem quickly. Years, probably decades, of blood, sweat, and tears will be needed. I think the best hope, though, is to find ways to take advantage of all the tools at our disposal -- the -omics tools, in particular -- to tackle this problem with reasonably close coordination between computational and experimental and theoretical researchers. The most valuable approaches are going to be the ones that accelerate experimental work by utilizing hypothesis generation from large data sets, targeted data gathering in pursuit of a particular question, and pointed molecular biology and biochemistry experiments looking at what specific genes and pathways do. How much would this all cost? Suppose I was a program manager and somebody gave me \$5m a year for 10 years to make this happen. What would be my Fantasy Grants Agency split? (Note that, to date, no one has offered to give me that much money, and I'm not sure I'd want the gig. But it's a fun brainstorming approach!)

I would devote roughly a third of the money to culture-based efforts (#1-#3), a third to building computational tools to support analysis and modeling (#6-#9), and a third to developing out the crazy ideas (#4, #5, and #10). I'd probably start by asking for a mix of 3 and 5 year grant proposals: 3 years of lots of money for the stuff that needs targeted development, 5 years of steady money for the crazier approaches. Then I'd repeat as needed, trying to balance the craziness with results.

More importantly, I'd insist on pre-publication sharing of all the data within a walled garden of all the grantees, together with regular meetings at which all the grad students and postdocs could mix to talk about how to make use of the data. (This is an approach that Sage Biosciences has been pioneering for biomedical research.) I'd probably also try to fund one or two groups to facilitate the data storage and analysis -- maybe at \$250k a year or so? -- so that all of the technical details could be dealt with. Is \$50m a lot of money? I don't think so, given the scale of the problem. I note that a few years back, the NIH NIAID proposed to devote 1-3 R01s (so \$2-4m total) to centers devoted to exploring the function of 10-20 pathogen genes each, so that's in line with what I'm proposing for tackling a much larger problem. --titus Matthieu Brucher Announcement: Audio TK 0.3.0 Just after the release of ATK SD1, I updated my audio toolkit. I added a few optimizations on overdrive computations, and also for the base filter array exchange. Future releases should encompass compressors and amplifiers. Download link: ATK 0.3.0 Changelog: 0.3.0 * Enhanced the ScalarNewtonRaphson algorithm to accept different precision. Used a more relaxed one by default * Fixed a bug in SD1OverdriveFilter * Optimized array copies between filters * Added documentation 0.2.1 * Fix some issues with deploying ATK * Tone stacks (Bassman and JCM800) with Python wrappers * Updated filters C++ interface when then can process several channels at the same time Buy Me a Coffee! Other Amount: Your Email Address : June 25, 2014 Janani Padmanabhan With those mid-sem bells chiming, it is time for another update. The following checkpoints have been reached: 1. The implementation of ellipsoidal harmonic function (also known as Lame's function): The first kind. The following is the link to the pull request: https://github.com/scipy/scipy/pull/3753 The implementation is in Cython and calls LAPACK subroutine. It is based on the python implementation by Knepley and Bardhan given here Further the absence of Lame's function implementation by any many other libraries there is a challenge in preparation of an extensive test-suite. At present the output of the function for certain range of inputs is tested. The immediate next plan is to try improving the test-suite. This will be followed by the implementation of Ellipsoidal harmonic function: The second kind. 2. Before this the spherical harmonic functions were improved by reimplementing them in Cython rather than python thus improving the speed. The details having been elaborately touched in the previous post, are omitted here, saving people from the boredom due to redundancy. The pull request can be accessed from https://github.com/scipy/scipy/pull/3692 Thanks to the constant support of my brilliant mentors Pauli, Ralf and Stefan, (and I suppose, few miracles!) the progress is as per schedule. Hoping that this pace will be maintained or even better, quickened; post mid-sem evaluation! Signing off till next time, Janani June 24, 2014 Juan Nunez-Iglesias jnuneziglesias This year I am privileged to be a mentor in the Google Summer of Code for the scikit-image project, as part of the Python Software Foundation organisation. Our student, Vighnesh Birodkar, recently came up with a clever use of SciPy’s ndimage.generic_filter that is certainly worth sharing widely. Vighnesh is tasked with implementing region adjacency graphs and graph based methods for image segmentation. He initially wrote specific functions for 2D and 3D images, and I suggested that he should merge them: either with n-dimensional code, or, at the very least, by making 2D a special case of 3D. He chose the former, and produced extremely elegant code. Three nested for loops and a large number of neighbour computations were replaced by a function call and a simple loop. Read on to find out how. Iterating over an array of unknown dimension is not trivial a priori, but thankfully, someone else has already solved that problem: NumPy’s nditer and ndindex functions allow one to efficiently iterate through every point of an n-dimensional array. However, that still leaves the problem of finding neighbors, to determine which regions are adjacent to each other. Again, this is not trivial to do in nD. scipy.ndimage provides a suitable function, generic_filter. Typically, a filter is used to iterate a “selector” (called a structuring element) over an array, compute some function of all the values covered by the structuring element, and replace the central value by the output of the function. For example, using the structuring element:  fp = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]], np.uint8)  and the function np.median on a 2D image produces a median filter over a pixel’s immediate neighbors. That is,  import functools median_filter = functools.partial(generic_filter, function=np.median, footprint=fp)  Here, we don’t want to create an output array, but an output graph. What to do? As it turns out, Python’s pass-by-reference allowed Vighnesh to do this quite easily using the “extra_arguments” keyword to generic_filter: we can write a filter function that receives the graph and updates it when two distinct values are adjacent! generic_filter passes all values covered by a structuring element as a flat array, in the array order of the structuring element. So Vighnesh wrote the following function:  def _add_edge_filter(values, g): """Add an edge between first element in values and all other elements of values in the graph g. values[0] is expected to be the central value of the footprint used. Parameters ---------- values : array The array to process. g : RAG The graph to add edges in. Returns ------- 0.0 : float Always returns 0. """ values = values.astype(int) current = values[0] for value in values[1:]: g.add_edge(current, value) return 0.0   Then, using the footprint:  fp = np.array([[0, 0, 0], [0, 1, 1], [0, 1, 0]], np.uint8)   (or its n-dimensional analog), this filter is called as follows on labels, the image containing the region labels:  filters.generic_filter(labels, function=_add_edge_filter, footprint=fp, mode='nearest', extra_arguments=(g,))   This is a rather unconventional use of generic_filter, which is normally used for its output array. Note how the return value of the filter function, _add_edge_filter, is just 0! In our case, the output array contains all 0s, but we use the filter exclusively for its side-effect: adding an edge to the graph g when there is more than one unique value in the footprint. That’s cool. Continuing, in this first RAG implementation, Vighnesh wanted to segment according to average color, so he further needed to iterate over each pixel/voxel/hypervoxel and keep a running total of the color and the pixel count. He used elements in the graph node dictionary for this and updated them using ndindex:  for index in np.ndindex(labels.shape): current = labels[index] g.node[current]['pixel count'] += 1 g.node[current]['total color'] += image[index]   Thus, together, numpy’s nditer, ndindex, and scipy.ndimage’s generic_filter provide a powerful way to perform a large variety of operations on n-dimensional arrays… Much larger than I’d realised! You can see Vighnesh’s complete pull request here and follow his blog here. If you use NumPy arrays and their massive bag of tricks, please cite the paper below! Stefan Van Der Walt, S. Chris Colbert, & Gaël Varoquaux (2011). The NumPy array: a structure for efficient numerical computation Computing in Science and Engineering 13, 2 (2011) 22-30 arXiv: 1102.1523v1 Matthieu Brucher Announcement: ATK SD1 1.0.0 I’m happy to announce the release of my third audio plugin, the first based on the Audio Toolkit. It is available on Windows and OS X in different formats. The UI was designed by Florent Keller, many thanks to him! ATKSD1 UI The supported formats are: • VST2 (32bits/64bits on Windows, 64bits on OS X) • VST3 (32bits/64bits on Windows, 64bits on OS X) • Audio Unit (64bits, OS X) The files can be downloaded on SourceForge, as well as the source code. Update I upgraded the Audio Toolkit to fix some 32bits cracks, so please download 1.0.1 (link is up to date). Buy Me a Coffee! Other Amount: Your Email Address : June 23, 2014 Issam Laradji Mid-term Summary GSoC 2014: Extending Neural Networks Module for Scikit-Learn The objective is to implement neural network algorithms in a clean, well-tested code using the scikit-learn API. The algorithms are meant to be user-friendly and easy to edit and scale for those who wish to extend, or debug them if necessary. Since the start of GSoC 2014 until now, I completed two modules, multi-layer perceptron (mlp) #3204 and mlp with pre-training #3281, which are pending final review for merging. I also implemented the extreme learning machine (elm) algorithm #3306 which hasn't been reviewed yet and more components such as test files, examples, and documentations are required. However, I am confident that I will complete it by the deadline I set in the proposal. In the following three sections, I will explain the modules in more detail. 1) Multi-layer perceptron #3204 Multi-layer perceptron (MLP) supports more than one hidden layer allowing it to construct highly non-linear functions. Figure 1 displays an MLP with 1 hidden layer. Figure 1 To define the number of hidden layers and their neurons, one can simply run the following statement. The list '[150, 100]' means that two hidden layers are constructed with 150 and 100, neurons respectively. Further, MLP can be used for reinforcement learning where each time step makes a new training sample. It can use the partial_fit method to update its weights on per sample basis in real-time (stochastic update). MLP also consists of a regularization term alpha as part of its parameters, whose value determines the degree of non-linearity the function is meant to have. Therefore, if the algorithm is overfitting, it is desirable to increase alpha to have a more linear function. Figure 2 demonstrates the decision boundaries learnt by mlp with different alpha values. Figure 2 Figure 2 shows that the higher the value of alpha, the less curves the decision boundary will have. The implementation has passed through various test cases to prevent unexpected behavior. One of the test cases involves comparing between the algorithm's analytic computation of the gradient and its numerical computation. Since the difference between the values was found to be at most a very small value means the backpropagation algorithm is working as expected. 2) MLP with pre-training #3281 One issue with MLP is that it involves random weights' initialization. The weights could land in a poor position in the optimization (see Figure 3) whose final solutions are not as good as they could be. Figure 3 Pre-training is one scheme to have the initial weights land in a better start. Restricted boltzmann machines (RBMs) can find such initial weights. Figure 4 shows the process of pre-training. Figure 4 For each layer in the network, there is an RBM that trains on the inputs given for that layer. The final weights of the RBM are given as the initial weights of the corresponding layer in the network. Running an example of pre-training has showed that RBMs can improve the final performance. For instance, on the digits the dataset, the following results were obtained. 1) Testing accuracy of mlp without pretraining: 0.964 2) Testing accuracy of mlp with pretraining: 0.978 3) Extreme learning machine (elm) #3306 Much of the criticism towards MLP is in its long training time. MLP uses the slow gradient descent to updates its weights iteratively, involving many demanding computations. Extreme learning machines (ELMs) [1], on the other hand, can train single hidden layer feedforward networks (SLFNs) using least square solutions instead of gradient descent. This scheme requires only few matrix operations, making it much faster than gradient descent. It also has a strong generalization power since it uses least-squares to find its solutions. The algorithm has been implemented and it passed the travis tests. But it still awaits more thorough review and test files to anticipate errors. I believe I will finalize the module by 29 June as per the proposal. Remaining work In the remaining weeks my tasks are broken down as follows. Week 7, 8 (June 30 - July 13) I will implement and revise regularized ELMs [3] and weighted ELMs [4], and extend the ELMs documentation. Week 9, 10 (July 14- July 27) I will implement and revise Sequential ELMs [2], and extend the ELMs documentation. Week 11, 12 (July 28- August 10) I will implement and revise Kernel-Based ELMs, and extend the ELMs documentation. Week 13 - Wrap-up Acknowledgements I would like to thank my mentors and reviewers including @ogrisel, @larsmans @jnothman, @kasternkyle, @AlexanderFabisch for dedicating their time in providing useful feedback and comments, making sure the work meets high-quality standards. I sincerely appreciate the time PSF admins take to oversee the contributers as it encourages us to set a higher bar for quality work. I would also like to thank GSoC 2014, as this wouldn't have been possible if it hadn't been for their support and motivation. References [1] http://www.di.unito.it/~cancelli/retineu11_12/ELM-NC-2006.pdf [2] http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf [3] http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-Unified-Learning.pdf [4] Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen. "Weighted extreme learning machine for imbalance learning." Neurocomputing 101 (2013): 229-242. June 22, 2014 Hamzeh Alsalhi Google Summer of Code Midterm Progress Summary The first half of my summer of code has resulted in the implementation of just under half of my goals, one of the planned six pull requests has been completely finalized, two are on their way to being finalized, and two of the remaining three will be started shortly. The final pull request is a more independent feature which I aim to start as soon as I am confident the others are close to being wrapped up. Thank you Arnaud, Joel, Oliver, Noel, and Lars for the time taken to give the constructive criticism that has vastly improved my implementation of these changes to the code base. Sparse Input for Ensemble Methods Gradient Boosted Regression Trees is the one remaining ensemble method that needs work before all of scikit-learns ensemble classifiers support sparse input. The first pull request I made this summer was for sparse input on the AdaBoost ensemble method. The AdaBoost pull request was merged, after AdaBoost sparse input support was completed I have skipped to latter goals with the intention to come back and pickup work on GBRT when a pending pull request for sparse input decision trees is merged which will make it easier to continue work on the sparse input for the ensemble method. PR #3161 - Sparse Input for AdaBoost Status: Completed and Merged Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized. PR - Sparse input Gradient Boosted Regression Trees (GBRT) Status: To be started Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends heavily on the sparse support for decision trees which is a pending mature pull request here PR #3173. Sparse Output Support The Sparse Label Binarizer pull request has gone through numerous revisions after being based of existing code written in PR and it contains a large part of the work necessary to support sparse output for One vs. Rest classification. With this support in place many of the binary classifiers in scikit-learn can be used in a one vs. all fashion on sparse target data. Support for sparse target data in the multilabel metrics will be implemented to provide users with metrics while avoiding the need to densify the target data. Finally in attempt to push support for sparse target data past one vs. rest methods I will work on spare target data support for decision trees . PR #3203 - Sparse Label Binarizer Status: Nearing Completion Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer. PR #3276 - Sparse Output One vs. Rest Status: Work In Progress Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data. PR - Sparse Metrics StatusTo be started Summary of the work done: Modify the metrics and some misc tools to support sparse target data so sparsity can be maintained throughout the entire learning cycle. The tools to be modified include precision score, accuracy score, parameter search, and other metrics listed on scikit-learns model evaluation documentation under the classification metrics header. PR - Decision Tree and Random Forest Sparse Output StatusTo be started Summary of the work done: Make revisions in the tree code to support sparsely formatted target data and update the random forest ensemble method to use the new sparse target data support. Plan for the Coming Weeks In hopes that the sparse label binarizer will be merged soon after making final revisions, early next week I will begin to respond to the reviews of the sparse One vs. Rest pull request and we will also see the beginnings of the sparse metrics pull request which should be wrapped up and ready for reviews in the same week. Following that the next focus will be rewinding to sparse input for ensemble methods and putting a week of work into sparse support for GBRT. Finally the sparse output decision tree pull request will be started when the remaining goals are nearing completion. June 20, 2014 Richard Tsai GSoC2014: The First Month The first month of my GSoC has passed and it is about time to make a summary. The PRs I’ve made 4 Pull-Requests for my GSoC project, 3 of which have been merged. • #3636 MAINT: cluster: some clean up Some cleanups. Mainly for docs. • #3683 ENH: cluster: rewrite and optimize vq in Cython The most challenging part. Discussed in this post already. • #3702 BUG: cluster: _vq unable to handle large features When I started to work on #3712, I noticed that the new version of _vq I wrote in the previous PR(#3683) may give wrong results when the values of the input features are very large. I though it might be an overflow issue since there was a matrix multiplication in the new algorithm. But I was wrong. It was caused by an initialization issue. I used the C macro NPY_INFINITY to initialize the output distances vector. However, ndarray.fill would regard NPY_INFINITY as the “integral infinity” and fill the float (or double) vector with $$2 ^ {31} - 1$$. When the actual minimum distance was greater than such an “infinity”, the result would be an uninitialized value. The currect way to initialize the output distances vector should be outdists.fill(np.inf). • #3712 ENH: cluster: reimplement the update-step of K-means in Cython This PR is my recent work and hasn’t been merged yet. An iteration of K-means requires two steps, the assignment step and the update step. The assignment step, implemented as vq.vq, has been optimized in the previous PRs. But the update step is still time-consuming, especially for low-dimensional data. The current implementation of the update step in SciPy looks like: for j in range(nc): mbs = np.where(label == j) code[j] = np.mean(data[mbs], axis=0)  However, np.where is slow. A much more efficient algorithm would be: for j in range(nobs): code[labels[j], :] += data[j, :] counts[labels[j]] += 1 for j in range(nc): code[j] /= counts[j]  But of course looping through the observation matrix in pure python would be extremely slow. So I implemented it in Cython and archieved decent speedups especially in low-dimensional data. Achievement First of all, I rewrote the underlying part of cluster.vq in Cython and it should be easier to maintain than the original C implementation. Then, the performance. The new algorithms and optimization tricks work well. I built SciPy with ATLAS and tested vq.kmeans(obs, k) on a i5-3470 machine, and I got the following performance statistics.  dataset origin #3683 #3712 speedup 30000 x 100, k = 10, float64 28636.2594 16950.3884 12687.425 2.26 30000 x 100, k = 10, float32 13575.0532 6483.4734 5272.2 2.57 10000 x 30, k = 10, float64 1923.3946 1146.7348 752.193 2.56 10000 x 30, k = 10, float32 1491.879 989.209 658.1064 2.27 100000 x 3, k = 4, float64 1663.143 1181.5912 529.0992 3.14 100000 x 3, k = 4, float32 1278.033 967.4396 431.0248 2.97 10000 x 1, k = 3, float64 117.6908 94.0622 41.6186 2.83 10000 x 1, k = 3, float32 110.2874 98.6826 41.9904 2.63 1000 x 3, k = 4, float64 29.23 28.697 9.8542 2.97 1000 x 3, k = 4, float32 27.5242 29.2488 12.4894 2.20 What’s Next There’re still about two months to go. First, I will continue to work on #3712. This PR still has some issues. There’re some build warnings which seem to be caused by Cython fused types on some machine. Then, I will start to work on cluster.hierarchy. The hierarchy module will be more complicated and challenging then vq and it may take me about a month to rewrite it in Cython. And then I’ll try to implement the SLINK and CLINK algorithm. At last, I want to thank my mentors Ralf, Charles and David, and of course everyone else in the SciPy community, who have helpped me a lot. I’m very glad to be working with you! NeuralEnsemble GSoC Open Source Brain: Retinal Filter II LGN-Retinal Filter II Now that we know how to crate a filter is time to use it to calculate how an LGN neuron would react to an incoming stimulus. In this entry we will create a white noise stimulus in order to see how an LGN neuron reacts to it, this approach has the advantage that we can then recover the filter by reverse correlation methods as a sanity check. In the same spirit of the last post, we will define the spatial and time parameters that determine the lengths and resolutions in those dimensions:  #Time parameters dt = 1.0 # resolution of the response (in milliseconds) dt_kernel = 5.0 # resolution of the kernel (in milliseconds) dt_stimuli = 10.0 # resolution of the stimuli (in milliseconds) kernel_size = 25 # The size of the kernel T_simulation = 2 * 10 ** 2.0 # Total time of the simulation in ms Nt_simulation = int(T_simulation / dt) #Simulation points N_stimuli = int(T_simulation / dt_stimuli) #Number of stimuli # Space parameters dx = 1.0 Lx = 20.0 Nx = int(Lx / dx) dy = 1.0 Ly = 20.0 Ny = int(Ly / dy )  Now, we call our kernel which we have wrapped-up as a function from the work in the last post:  # Call the kernel # Size of center area in the center-surround profile sigma_center = 15 # Size of surround area in the center-surround profile sigma_surround = 3 kernel = create_kernel(dx, Lx, dy, Ly, sigma_surround, sigma_center, dt_kernel, kernel_size)  With this in our hand we can use the numpy random functions to create our white noise stimuli, we use here the realization of white noise call ternary noise which consists on values of -1, 0 and 1 assigned randomly to each pixel in our stimuli:  # Call the stimuli stimuli = np.random.randint(-1, 2, size=(N_stimuli, Nx, Ny))  Before we can proceed to calculate the convolution we need to do some preliminary work. The convolution problem involves three time scales with different resolutions. We have first the resolution of the response dt , the resolution of the kernel dt_kernel and finally the resolution of the stimulus dt_stimuli.Operations with the kernel involve jumping from one scale to another constantly so we need a mechanism to keep track of that. In short, we would like to have a mechanism that transforms from some coordinates to the others in one specific place and not scatter all over the place. Furthermore, in the convolution the kernel is multiplied by a specific point of images for each point in time. For the sake of efficiency we would like to have a mechanism that does this for once. With this in mind I have built a set of indexes for each scale that allow us to associate each element on the indexes of the response to its respective set of images. Also, we have a vector that associates every possible delay time in the kernel to the set of indexes in the response. We illustrate the mechanisms in the next figure We can appreciate the three different times scales in the image. Furthermore, we have a set of indexes called delay indexes that maps each response to its respective image and also other set of indexes called delay indexes that map each of the delays to his respective response. We can create this set of indexes with the following code:  # Scale factors input_to_image = dt / dt_stimuli # Transforms input to image kernel_to_input = dt_kernel / dt # Transforms kernel to input input_to_kernel = dt / dt_kernel # Transforms input to kernel working_indexes = np.arange(Nt_simulation).astype(int) # From here we remove the start at put the ones remove_start = int(kernel_size * kernel_to_input) signal_indexes = np.arange(remove_start, Nt_simulation).astype(int) # Calculate kernel kernel_times = np.arange(kernel_size) kernel_times = kernel_times.astype(int) # Delay indexes delay_indexes = np.floor(kernel_times * kernel_to_input) delay_indexes = delay_indexes.astype(int) # Image Indexes stimuli_indexes = np.zeros(working_indexes.size) stimuli_indexes = np.floor(working_indexes * input_to_image) stimuli_indexes = stimuli_indexes.astype(int)  Now, we can calculate the response of a neuron with a center-surround receptive field by performing the convolution between its filter and the stimuli. We also plot the stimuli to see how it looks:  for index in signal_indexes: delay = stimuli_indexes[index - delay_indexes] # Do the calculation signal[index] = np.sum(kernel[kernel_times,...] * stimuli[delay,...]) t = np.arange(remove_start*dt, T_simulation, dt) plt.plot(t, signal[signal_indexes], '-', label='Kernel convoluted with noise') plt.legend() plt.xlabel('Time (ms)') plt.ylabel('Convolution') plt.grid() plt.show()  We can see that the resolution of the response is as good as the resolution of the filter and this explains the discontinuities in the figure above. As a sanity check we can calculate a voltage triggered average to recover the sta:  ## Calculate the STA kernel_size = kernel_times.size Nside = np.shape(stimuli)[2] sta = np.zeros((kernel_size ,Nside, Nside)) for tau, delay_index in zip(kernel_times, delay_indexes): # For every tau we calculate the possible delay # and take the appropriate image index delay = stimuli_indexes[signal_indexes - delay_index] # Now we multiply the voltage for the appropriate images weighted_stimuli = np.sum( signal[signal_indexes, np.newaxis, np.newaxis] * stimuli[delay,...], axis=0) # Finally we divide for the sample size sta[tau,...] = weighted_stimuli / signal_indexes.size  Which we can plot in a convenient way with the following set of instructions:  ## Visualize the STA closest_square_to_kernel = int(np.sqrt(kernel_size)) ** 2 # Define the color map cdict1 = {'red': ((0.0, 0.0, 0.0), (0.5, 0.0, 0.1), (1.0, 1.0, 1.0)), 'green': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)), 'blue': ((0.0, 0.0, 1.0), (0.5, 0.1, 0.0), (1.0, 0.0, 0.0)) } from matplotlib.colors import LinearSegmentedColormap blue_red1 = LinearSegmentedColormap('BlueRed1', cdict1) n = int( np.sqrt(closest_square_to_kernel)) # Plot the filters for i in range(closest_square_to_kernel): plt.subplot(n,n,i + 1) plt.imshow(sta[i,:,:], interpolation='bilinear', cmap=blue_red1) plt.colorbar() plt.show()  Philip Herron redbrain812 TLDR: Rant post about maven, java and jenkins plugin development So recently i done some work on the Coverity Plugin for Jenkins. Note: I think coverity is a complete waste of money by the way for C/C++ development the latest GCC/LLVM shows you as much as you need. I’ve got some good experience with Maven from hadoop development. I understand the problem its trying to solve huge multi-module projects have so many dependencies. The java ecosystem is just HUGE, i mean to build this plugin from a clean maven it pulls down hundreds of jar’s as dependencies. Its kind of insane. I want as you read this blog post to remember that, all this plugin actually does is runs your jenkins build as you would expepect but runs some extra coverity tools which you need to manually have installed on your slave automatically. But working with jenkins plugins with maven is sooo horrible. I mean because the pom.xml had issues and warnings i couldn’t figure out how to use it with intellij, the thing would never build but just give really cryptic errors. Then i just go back to emacs, commandline and run mvn compile to add in a MacroExpansionToken then to test it i have a local tomcat instance, so i have the jenkins.war already running there with a helloworld build to test the code. I run mvn clean compile install -DskipTests to create the jenkins .hpi then manually go to jenkins manage plugins and upload it. Then restart jenkins run the build and then find a bug go back to the begining. Its so slow and painful and plugin development is sooooo awful with jenkins. I mean its the problem with java overall, making everything i mean everything so generic, yeah sure you can really easily reuse _any_ piece of code or embed some cool service into an application. But because everything has such generic interfaces programming with them you usually have really obscure abstract interfaces to work with and you need to wrap your head around before you can actually solve the problem. I mean maven isn’t any kind of useful build tool it doesn’t even build software well, its a lump of horrible xml where you don’t even declare what you want it just automatically treats anything in src/main/{java/groovy/test/…} as a .jar unless you want a jar. This xml is really is awful to work with no documentation and a plugin for everything. I mean if you don’t have an IDE like eclipse of intellij java development is almost just entirely impossible these days. There is just so much crap and boilerplate to fill out you won’t remember all of it to do it correctly. It instantly forces tonnes of convention before it actually solves anything. Ant and ivy was so nice to work with before. I dont understand how maven made anything easier i mean i can sort of see it with hadoop it works pretty ok there. But even then does it? With C/C++ or even Python or Node everything is so simple and if you need more complex systems there is alternatives for that. Everything about those languages they are very focused on solving an issue. Golang is about the only language that exists that almost reminds me of maven like a revel project or even just a simple library the dependency management is nice but it forces a fair amount of convention, but i think its bearable because its been like this always not like maven just poping along after a while and everything is quite fragmented. I mean Java or any JVM or any .NET software is the only languages where you can spend literally weeks declaring boilerplate data-flow and object models for weeks. Before you actually write code to do what you need. June 19, 2014 Fabian Pedregosa Surrogate Loss Functions in Machine Learning TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made. In the binary-class classification setting we are given$n$training samples$\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$, where$X_i$belongs to some sample space$\mathcal{X}$, usually$\mathbb{R}^p$but for the purpose of this post we can keep i abstract, and$y_i \in \{-1, 1\}$is an integer representing the class label. We are also given a loss function$\ell: \{-1, 1\} \times \{-1, 1\} \to \mathbb{R}$that measures the error of a given prediction. The value of the loss function$\ell$at an arbitrary point$(y, \hat{y})$is interpreted as the cost incurred by predicting$\hat{y}$when the true label is$y$. In classification this function is often the zero-one loss, that is,$\ell(y, \hat{y})$is zero when$y = \hat{y}$and one otherwise. The goal is to find a function$h: \mathcal{X} \to [k]$, the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function$h$that minimizes the expected$\ell$-risk, given by $$\mathcal{R}_{\ell}(h) = \mathbb{E}_{XY}[\ell(Y, h(X))]$$ In theory, we could directly minimize the$\ell$-risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of$XY$is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function$\ell$is discontinuous. Take for example a problem in which$\mathcal{X} = \mathbb{R}^2, k=2$, and we seek to find the linear function$f(X) = \text{sign}(X w), w \in \mathbb{R}^2$and that minimizes the$\ell$-risk. As a function of the parameter$w$this function looks something like This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function$\Psi: \mathbb{R} \to \mathbb{R}_+$. An example of such surrogate loss functions is the hinge loss,$\Psi(t) = \max(1-t, 0)$, which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss,$\Psi(t) = 1/(1 + \exp(-t))$, used by the logistic regression model. If we consider the logistic loss, minimizing the$\Psi$-risk, given by$\mathbb{E}_{XY}[\Psi(Y, f(X))]$, of the function$f(X) = X w$becomes a much more more tractable optimization problem: In short, we have replaced the$\ell$-risk which is computationally difficult to optimize with the$\Psi$-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the$\Psi$-risk leads to a function that also minimizes the$\ell$-risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function$\Psi$: for some functions$\Psi$it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if$\Psi$is convex then it is consistent if and only if it is differentiable at zero and$\Psi'(0) < 0$. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions. 1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. June 18, 2014 NeuralEnsemble GSoC Open Source Brain: Retinal Filter I LGN-Retinal filter Treating the retina and the LGN together as a spatio-temporal filter is a common trait in the models that we are going to work with in this project. In this series of articles I am going to develop and explain the mechanism that I have developed to model this particular stage of processing. Structure of the filter In this particular example we are dealing with a separable spatio-temporal filter. That is, we can write our filter as a product of the spatial part and the temporal part. In the following sections, we will describe them in this given order. Spatial filter In order to calculate the spatial filter we need two things. First, how wide our filters are going to be and then the resolution level. So lx and ly will stand for the size of the receptive filters (in degrees) for the x and y direction respectively. The same will go for dx and dy with respect to the resolution. With this in our hands we then can proceed to create a 2 dimensional structure for our functions with the help of the mesh functions:  ## Spatial parameters x = np.arange(-lx/2, lx/2, dx) y = np.arange(-ly/2, ly/2, dy) X, Y = np.meshgrid(x, y) # Create the 2D dimensional coordinates  Now that we have the function it is just a matter of calculating the function with the appropiate formula:  R = np.sqrt(X**2 + Y**2) # Distance center = (17.0 / sigma_center**2) * np.exp(-(R / sigma_center)**2) surround = (16.0 / sigma_surround**2) * np.exp(-(R / sigma_surround)**2) Z = surround - center  With the formula in our hands we can plot it as a countour line to have an idea of how it looks like in space:  # Plot countour map plt.contourf(X, Y, Z, 50, alpha=0.75, cmap=plt.cm.hot) plt.colorbar() C = plt.contour(X, Y, Z, 10, colors='black', linewidth=.5) plt.clabel(C, inline=10, fontsize=10) plt.show()  We can also show a view when y=0 in order to offer a side view of the plot in order to gain further insight in the structure of it: Temporal filter The behaviour of the temporal filters in time is usually called bi-phasic response. This consists in a response that initially grows in time for the mean level but then it goes bellow to the mean level to finally return at it in time window of the order 200ms approximately. The particular function and parameters that we are going to use to illustrate this behaviour were taken from the paper by Cai et Al (Reference bellow). But first, we have to define our kernel size and resolution as in the code down here:  # First the kernel size and resolution kernel_size = 200 dt_kernel = 1.0 t = np.arange(0, kernel_size * dt_kernel, dt_kernel) # Time vector  Now, we construct the function in the following way:  ## Temporal parameters K1 = 1.05 K2 = 0.7 c1 = 0.14 c2 = 0.12 n1 = 7.0 n2 = 8.0 t1 = -6.0 t2 = -6.0 td = 6.0 p1 = K1 * ((c1*(t - t1))**n1 * np.exp(-c1*(t - t1))) / ((n1**n1) * np.exp(-n1)) p2 = K2 * ((c2*(t - t2))**n2 * np.exp(-c2*(t - t2))) / ((n2**n2) * np.exp(-n2)) p3 = p1 - p2  We plot the function to have an idea of how it looks now: Spatio-Temporal Filter Finally, in order to have build the kernel, given that our filter is separable, we just have to multiply the temporal the temporal part by the spatial part at each point in space:  # Initialize and fill the spatio-temporal kernel kernel = np.zeros((kernel_size, int(lx/dx), int(ly/dy))) for k, p3 in enumerate(p3): kernel[k,...] = p3 * Z  Which we can show now in the following plot: References • Cai, Daqing, Gregory C. Deangelis, and Ralph D. Freeman. "Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens." Journal of Neurophysiology 78.2 (1997): 1045-1061. Continuum Analytics Binstar is Out of Beta, Now Available for Everyone I am very excited to announce the public availability of Binstar. Binstar is a free service for hosting public packages for pip and conda with support for R, npm, and other formats in the near future. Private packages can also be hosted for a nominal monthly subscription. An on-premise version of Binstar is also available as part of Anaconda Server. June 16, 2014 Titus Brown Being a release manager for khmer We just released khmer v1.1, a minor version update from khmer v1.0.1 (minor version update: 220 commits, 370 files changed). Cancel that -- _I_ just released khmer, because I'm the release manager for v1.1! As part of an effort to find holes in our documentation, "surface" any problematic assumptions we're making, and generally increase the bus factor of the khmer project, we have been switching up the release manager for every 1.0+ release. The first release manager (for v1.0) was Michael Crusoe (@biocrusoe); the second (for v1.0.1) was (@luizirber); and I'm the third, for v1.1. Each time through, we make our release process docs more explicit and more correct. There's really no better way to do it; each RM comes with a different background and a different skillset, so by the time four or five people have cut a release, we should have ironed out any wrinkles. Roughly speaking, our release process consists of: 1. Building a release candidate 2. Testing the release candidate in multiple ways using virtualenvs on a Linux box. 3. Pushing to the test PyPI server and doing install tests in a virtualenv. 4. Doing multi-system tests on the BaTLab, and running our acceptance tests on Amazon Web Services. 5. If that all passes for a release candidate, cutting the final release! There are about a dozen steps in all, with 40-50 command line steps. It's a bit complicated but it's all codified in a somewhat cockamamie semi-automated copy-and-paste doc that actually works pretty well. For me, the big challenge was getting the BaTLab multi-platform install tests to run; this required about 30 minutes of handholding by Michael. Once they ran we discovered a few problems, the biggest of which was a breakage of Python 2.6 compatibility -- which we simply wouldn't have found otherwise. For the next release manager, the challenge will be to get through the acceptance tests; for v1.1, we added acceptance tests based on our metagenome protocol, so we now test about 90% of our software's command line interface on real data before releasing it. But since I'm the person in charge of the acceptance testing system, there are a few tricks and tips that I haven't completely codified, so I'll have to work on pushing those out the door. The main lesson I've learned from all of this is that you catch a lot more bugs with all this testing than you would any other way. It will hopefully result in a more pleasant user experience as people find fewer and fewer "dumb" bugs, and it certainly is giving us more confidence in our software processes' robustness as we contemplate some bigger changes to khmer down the line. Something that's also rather exciting is that we have three people who aren't part of the lab contributing code -- @chuckpr, @accaldwell, and @znruss. As our process becomes more explicit, I think we're attracting people who want to contribute but (6 months ago) wouldn't have known how. What's up next on the khmer timeline? We're participating in the July Mozilla Science Labs hackathon by offering a khmer contributathon so people can try out our workflow, and we're going to revamp the docs for that. We'll let you know how it goes. --titus June 14, 2014 Jake Vanderplas Frequentism and Bayesianism IV: How to be a Bayesian in Python I've been spending a lot of time recently writing about frequentism and Bayesianism. Here I want to back away from the philosophical debate and go back to more practical issues: in particular, demonstrating how you can apply these Bayesian ideas in Python. The workhorse of modern Bayesianism is the Markov Chain Monte Carlo (MCMC), a class of algorithms used to efficiently sample posterior distributions. Below I'll explore three mature Python packages for performing Bayesian analysis via MCMC: • emcee: the MCMC Hammer • pymc: Bayesian Statistical Modeling in Python • pystan: The Python Interface to Stan I won't be so much concerned with speed benchmarks between the three, as much as a comparison of their respective APIs. This post is not meant to be a tutorial in any of the three; each of them is well documented and the links above include introductory tutorials for that purpose. Rather, what I want to do here is a side-by-side comparison which will give a feel for how each package is used. I'll propose a single relatively non-trivial test problem, and show the implementation and results of this problem using all three packages. Hopefully by seeing the three approaches side-by-side, you can choose which package might be best suited for your particular application. Test Problem: Line of Best Fit¶ For our test problem, we'll do a three-parameter model which fits a straight line to data. The parameters will be the slope, the intercept, and the scatter about the line; the scatter in this case will be treated as a nuisance parameter. The Data¶ Let's define some data that we'll work with: In [1]: %matplotlib inline import matplotlib.pyplot as plt import numpy as np np.random.seed(42) theta_true = (25, 0.5) xdata = 100 * np.random.random(20) ydata = theta_true[0] + theta_true[1] * xdata # add scatter to points xdata = np.random.normal(xdata, 10) ydata = np.random.normal(ydata, 10)  In [2]: plt.plot(xdata, ydata, 'ok') plt.xlabel('x') plt.ylabel('y');  The data are clearly correlated, and we'll assume that we don't know the errors. Let's construct a linear model to fit this data. The Model: Slope, Intercept, & Unknown Scatter¶ Recall that Bayes' theorem gives $P(\theta~|~D) \propto P(D~|~\theta) P(\theta)$ Where $$D$$ represents the observed data, and $$\theta$$ represents the model. We'll assume a linear model for our data, parametrized by a slope $$\beta$$ and a y-intercept $$\alpha$$: $\hat{y}(x_i~|~\alpha,\beta) = \alpha + \beta x_i$ Assuming gaussian errors on the observed y values, the probability for any data point under this model is given by: $P(x_i, y_i~|~\alpha, \beta, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[\frac{-[y_i - \hat{y}(x_i~|~\alpha, \beta)]^2}{2\sigma^2}\right]$ where $$\sigma$$ here is an unknown measurement error, which we'll treat as a nuisance parameter Multiplying these for all $$i$$ gives the likelihood: $P(\{x_i\}, \{y_i\}~|~\alpha, \beta, \sigma) \propto (2\pi\sigma^2)^{-N/2} \exp\left[- \frac{1}{2\sigma^2} \sum_{i-1}^N [y_i - \hat{y}(x_i~|~\alpha,\beta)]^2\right]$ The Prior¶ Here we're going to be a bit more careful about the choice of prior than we've been in the previous posts. We could simply choose flat priors on $$\alpha$$, $$\beta$$, and $$\sigma$$, but we must keep in mind that flat priors are not always uninformative priors! A better choice is to follow Jeffreys and use symmetry and/or maximum entropy to choose maximally noninformative priors. This is the kind of exercise in choosing priors is one that frequentists often complain about as overly subjective, but the approach is well-founded and very well-supported from an information theoretic standpoint. Why might a flat prior a bad choice? This is perhaps easiest to see in the case of slope. Let's visualize this by plotting some lines with slopes evenly-spaced between 0 and 10: In [3]: fig, ax = plt.subplots(subplot_kw=dict(aspect='equal')) x = np.linspace(-1, 1) for slope in np.arange(0, 10, 0.1): plt.plot(x, slope * x, '-k') ax.axis([-1, 1, -1, 1], aspect='equal');  These lines have evenly-spaced slopes in units of 0.1, yet the higher slopes are bunched together. With a flat prior, you're essentially saying that any one of these slopes is just as likely as another. Due to this bunching seen above, it's clear that a flat prior on slope will highly favor very steep slopes! A flat prior on slope is not a minimally informative prior, and may end up biasing your result (though with enough data the effect is almost zero). You might imagine coming up with a better scheme by-hand (perhaps use a flat prior on the angle $$\theta$$ between the line and the x-axis) but we can be even more rigorous. The following problem has been well-explored in the Bayesian literature; the best resource I've found is a paper by Jaynes: Straight Line Fitting: A Bayesian Solution (pdf) Prior on Slope and Intercept¶ If our model is given by $y = \alpha + \beta x$ then we can construct a parameter-space probability element $$P(\alpha, \beta) ~d\alpha ~d\beta$$. Because $$x$$ and $$y$$ are symmetric, we could just as easily use another set of parameters $x = \alpha^\prime + \beta^\prime y$ with probability element $$Q(\alpha^\prime, \beta^\prime)d\alpha^\prime d\beta^\prime$$, where it's easy to show that $(\alpha^\prime,~\beta^\prime) = (- \beta^{-1}\alpha,~\beta^{-1}).$ From the Jacobian of the transformation, we can show that $Q(\alpha^\prime, \beta^\prime) = \beta^3 P(\alpha, \beta).$ Maintaining the symmetry of the problem requires that this change of variables should not affect the prior probability, so we can write: $\beta^3 P(\alpha, \beta) = P(- \beta^{-1}\alpha, \beta^{-1}).$ This is a functional equation which is satisfied by $P(\alpha, \beta) \propto (1 + \beta^2)^{-3/2}.$ which is equivalent to saying that $$\alpha$$ is uniformly distributed, and $$\beta$$ is distributed uniformly in $$\sin\theta$$ where $$\theta = \tan^{-1}\beta$$. This might surprise you that the slopes are distributed according to $$\sin\theta$$ rather than uniformly in $$\theta$$. This $$\sin\theta$$ term, though, can actually be thought of as coming from the intercept! If we change variables from $$\alpha$$ to $$\alpha_\perp = \alpha\cos\theta$$, then it's straightforward to show that our variables are uniformly distributed in $$(\alpha_\perp,~\theta)$$. We'll make use of this fact in the PyStan solution below. Prior on $$\sigma$$¶ Similarly, we want the prior on $$\sigma$$ to be invariant to rescalings of the problem (i.e. changing units). So our probability must satisfy $P(\sigma)d\sigma = P(\sigma / c)d\sigma / c.$ This is a functional equation satisfied by $P(\sigma) \propto 1 / \sigma.$ This is known as the Jeffreys Prior, after Harold Jeffreys. Putting our Priors Together¶ Putting these together, we see that symmetry arguments have led to the following minimally informative prior on our model: $P(\alpha, \beta, \sigma) \propto \frac{1}{\sigma}(1 + \beta^2)^{-3/2}$ As alluded to above, you could equivalently address this by using flat priors on transformed parameters, namely $$(\alpha, \beta, \sigma) \to (\alpha_\perp, \theta, \log\sigma)$$, but I personally think the symmetry/maximum entropy approach is cleaner and clearer – it also gives us a chance to demonstrate the definition of nontrivial custom priors within the three packages. Solving This Problem in Python¶ Now that we have the data, the likelihood, and the prior, let's show how to solve this problem in Python using emcee, PyMC, and PyStan. First, though, let's quickly define some convenience routines which will help us visualize the results: In [4]: # Create some convenience routines for plotting def compute_sigma_level(trace1, trace2, nbins=20): """From a set of traces, bin by number of standard deviations""" L, xbins, ybins = np.histogram2d(trace1, trace2, nbins) L[L == 0] = 1E-16 logL = np.log(L) shape = L.shape L = L.ravel() # obtain the indices to sort and unsort the flattened array i_sort = np.argsort(L)[::-1] i_unsort = np.argsort(i_sort) L_cumsum = L[i_sort].cumsum() L_cumsum /= L_cumsum[-1] xbins = 0.5 * (xbins[1:] + xbins[:-1]) ybins = 0.5 * (ybins[1:] + ybins[:-1]) return xbins, ybins, L_cumsum[i_unsort].reshape(shape) def plot_MCMC_trace(ax, xdata, ydata, trace, scatter=False, **kwargs): """Plot traces and contours""" xbins, ybins, sigma = compute_sigma_level(trace[0], trace[1]) ax.contour(xbins, ybins, sigma.T, levels=[0.683, 0.955], **kwargs) if scatter: ax.plot(trace[0], trace[1], ',k', alpha=0.1) ax.set_xlabel(r'$\alpha$') ax.set_ylabel(r'$\beta$') def plot_MCMC_model(ax, xdata, ydata, trace): """Plot the linear model and 2sigma contours""" ax.plot(xdata, ydata, 'ok') alpha, beta = trace[:2] xfit = np.linspace(-20, 120, 10) yfit = alpha[:, None] + beta[:, None] * xfit mu = yfit.mean(0) sig = 2 * yfit.std(0) ax.plot(xfit, mu, '-k') ax.fill_between(xfit, mu - sig, mu + sig, color='lightgray') ax.set_xlabel('x') ax.set_ylabel('y') def plot_MCMC_results(xdata, ydata, trace, colors='k'): """Plot both the trace and the model together""" fig, ax = plt.subplots(1, 2, figsize=(10, 4)) plot_MCMC_trace(ax[0], xdata, ydata, trace, True, colors=colors) plot_MCMC_model(ax[1], xdata, ydata, trace)  Without further ado, let's do some MCMC. Emcee¶ The emcee package (also known as MCMC Hammer, which is in the running for best Python package name in history) is a Pure Python package written by Astronomer Dan Foreman-Mackey. It is a lightweight package which implements a fairly sophisticated Affine-invariant Hamiltonian MCMC. Because the package is pure Python (i.e. it contains no compiled extensions) it is extremely easy to install; with pip, simply type at the command-line [~]$ pip install emcee

Emcee does not have much specific boilerplate code; it simply requires you to pass it a Python function which returns a value proportional to the log-posterior probability, and returns samples from that posterior. Here's how we solve the above problem with emcee:

In [5]:
import emcee
print(emcee.__version__)

2.0.0


In [6]:
# Define our posterior using Python functions
# for clarity, I've separated-out the prior and likelihood
# but this is not necessary. Note that emcee requires log-posterior

def log_prior(theta):
alpha, beta, sigma = theta
if sigma < 0:
return -np.inf  # log(0)
else:
return -1.5 * np.log(1 + beta ** 2) - np.log(sigma)

def log_likelihood(theta, x, y):
alpha, beta, sigma = theta
y_model = alpha + beta * x
return -0.5 * np.sum(np.log(2 * np.pi * sigma ** 2) + (y - y_model) ** 2 / sigma ** 2)

def log_posterior(theta, x, y):
return log_prior(theta) + log_likelihood(theta, x, y)

In [7]:
# Here we'll set up the computation. emcee combines multiple "walkers",
# each of which is its own MCMC chain. The number of trace results will
# be nwalkers * nsteps

ndim = 3  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 1000  # "burn-in" period to let chains stabilize
nsteps = 2000  # number of MCMC steps to take

# set theta near the maximum likelihood, with
np.random.seed(0)
starting_guesses = np.random.random((nwalkers, ndim))

In [8]:
# Here's the function call where all the work happens:
# we'll time it using IPython's %time magic

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[xdata, ydata])
%time sampler.run_mcmc(starting_guesses, nsteps)
print("done")

CPU times: user 5.73 s, sys: 13 ms, total: 5.75 s
Wall time: 5.76 s
done


In [9]:
# sampler.chain is of shape (nwalkers, nsteps, ndim)
# we'll throw-out the burn-in points and reshape:
emcee_trace = sampler.chain[:, nburn:, :].reshape(-1, ndim).T
plot_MCMC_results(xdata, ydata, emcee_trace)


On the left we show the resulting traces marginalized over the nuisance parameter $$\sigma$$. On the right, we show the line of best-fit along with the 2-$$\sigma$$ uncertainty region. This is exactly the type of result we expect from MCMC: marginalized uncertainty contours around a model which provides a good by-eye fit to the data.

PyMC¶

The PyMC package has many more features than emcee, including built-in support for efficient sampling of common prior distributions. PyMC by default uses the classic Metropolis-Hastings sampler, one of the earliest MCMC algorithms. For performance, it uses compiled fortran libraries, so it is less trivial to install using tools like pip. My machine has a working fortran compiler, so pip install pymc worked without a problem (but from working with students, colleagues, and tutorial attendees, I can tell you that few scientists today have a system setup such that this will work out-of-the-box). For folks who don't have a fortran compiler installed, PyMC binaries for many systems can be quite easily installed with conda.

I should mention that the future PyMC version 3 removes fortran dependence and makes the installation much more streamlined; I've also been told that the API of PyMC 3 is much cleaner, and that performance is much better. Because PyMC 3 is still listed as an alpha release, I've decided to stick with the current supported release for this post:

In [10]:
import pymc
print(pymc.__version__)

2.3


In [11]:
# Define the variables needed for the routine, with their prior distributions
alpha = pymc.Uniform('alpha', -100, 100)

@pymc.stochastic(observed=False)
def beta(value=0):
return -1.5 * np.log(1 + value ** 2)

@pymc.stochastic(observed=False)
def sigma(value=1):
return -np.log(abs(value))

# Define the form of the model and likelihood
@pymc.deterministic
def y_model(x=xdata, alpha=alpha, beta=beta):
return alpha + beta * x

y = pymc.Normal('y', mu=y_model, tau=1. / sigma ** 2, observed=True, value=ydata)

# package the full model in a dictionary
model1 = dict(alpha=alpha, beta=beta, sigma=sigma,
y_model=y_model, y=y)

In [12]:
# run the basic MCMC: we'll do 100000 iterations to match emcee above
S = pymc.MCMC(model1)
S.sample(iter=100000, burn=50000)


[-                 3%                  ] 3506 of 100000 complete in 0.5 sec
[--                6%                  ] 6801 of 100000 complete in 1.0 sec
[---              10%                  ] 10108 of 100000 complete in 1.5 sec
[-----            13%                  ] 13510 of 100000 complete in 2.0 sec
[------           16%                  ] 16684 of 100000 complete in 2.5 sec
[-------          20%                  ] 20018 of 100000 complete in 3.0 sec
[--------         23%                  ] 23384 of 100000 complete in 3.5 sec
[----------       26%                  ] 26671 of 100000 complete in 4.0 sec
[-----------      30%                  ] 30000 of 100000 complete in 4.5 sec
[------------     33%                  ] 33252 of 100000 complete in 5.0 sec
[-------------    36%                  ] 36635 of 100000 complete in 5.5 sec
[---------------  40%                  ] 40046 of 100000 complete in 6.0 sec
[---------------- 43%                  ] 43430 of 100000 complete in 6.5 sec
[-----------------46%                  ] 46732 of 100000 complete in 7.0 sec
[-----------------50%                  ] 50059 of 100000 complete in 7.5 sec
[-----------------52%                  ] 52746 of 100000 complete in 8.0 sec
[-----------------55%-                 ] 55446 of 100000 complete in 8.5 sec
[-----------------58%--                ] 58070 of 100000 complete in 9.0 sec
[-----------------60%--                ] 60486 of 100000 complete in 9.5 sec
[-----------------63%---               ] 63079 of 100000 complete in 10.0 sec
[-----------------65%----              ] 65668 of 100000 complete in 10.5 sec
[-----------------68%-----             ] 68369 of 100000 complete in 11.0 sec
[-----------------71%------            ] 71008 of 100000 complete in 11.5 sec
[-----------------73%-------           ] 73672 of 100000 complete in 12.0 sec
[-----------------76%--------          ] 76262 of 100000 complete in 12.5 sec
[-----------------78%---------         ] 78660 of 100000 complete in 13.0 sec
[-----------------81%----------        ] 81231 of 100000 complete in 13.5 sec
[-----------------83%-----------       ] 83790 of 100000 complete in 14.0 sec
[-----------------86%------------      ] 86407 of 100000 complete in 14.5 sec
[-----------------89%-------------     ] 89101 of 100000 complete in 15.0 sec
[-----------------91%--------------    ] 91789 of 100000 complete in 15.5 sec
[-----------------94%---------------   ] 94391 of 100000 complete in 16.0 sec
[-----------------97%----------------  ] 97081 of 100000 complete in 16.5 sec
[-----------------99%----------------- ] 99649 of 100000 complete in 17.0 sec
[-----------------100%-----------------] 100000 of 100000 complete in 17.1 sec

In [13]:
# extract the traces and plot the results
pymc_trace = [S.trace('alpha')[:],
S.trace('beta')[:],
S.trace('sigma')[:]]

plot_MCMC_results(xdata, ydata, pymc_trace)


We get traces very similar to those provided by emcee.

PyStan¶

The PyStan project is the official Python wrapper of the Stan Probabilistic programming language, which is implemented in C++. It uses a No U-Turn Sampler, which is more sophisticated than classic Metropolis-Hastings or Gibbs sampling. As far as API goes, the important difference between PyStan as compared to emcee and PyMC is that it requires you to write and compile non-Python code within your Python script when defining your model.

Because PyStan depends on the Stan package, installation can be difficult. In particular, you need the full Stan library to be installed in order to use the python wrapper. If you have a well-set-up environment for compiling C/C++ code, this shouldn't be a problem – using pip install pystan downloaded, compiled, and installed everything necessary on my system.

From what I could find online, it doesn't appear that the Stan team provides pre-built binaries, including for conda. Because of this, if you don't have your system set up for C/C++ compilation, installation will probably be more difficult.

I'm using version 2.2 of PyStan:

In [14]:
import pystan
print(pystan.__version__)

2.2.0.1



There is one more hiccup with this package: for some reason, PyStan runs extremely slowly and crashes my computer when I run it in the IPython notebook itself, but works without issues when I run it as an executable Python file. I'm not sure why this is but it may have something to do with this issue. To get around this, I'll save the PyStan script as a standalone Python file and execute it from the command-line, writing the resulting traces to disk:

In [15]:
%%file pystan_example.py

import numpy as np
import pystan

#---------------------------------------------
# Generate data (same as used in the notebook)

np.random.seed(42)
theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata

# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)

#----------------------------------------------
# Create the Stan model
#  this is done by defining a string of Stan code.

fit_code = """
data {
int<lower=0> N; // number of points
real x[N]; // x values
real y[N]; // y values
}

parameters {
real alpha_perp;
real<lower=-pi()/2, upper=pi()/2> theta;
real log_sigma;
}

transformed parameters {
real alpha;
real beta;
real sigma;
real ymodel[N];

alpha <- alpha_perp / cos(theta);
beta <- sin(theta);
sigma <- exp(log_sigma);
for (j in 1:N)
ymodel[j] <- alpha + beta * x[j];
}

model {
y ~ normal(ymodel, sigma);
}
"""

# perform the fit
fit_data = {'N': len(xdata), 'x': xdata, 'y': ydata}
fit = pystan.stan(model_code=fit_code, data=fit_data, iter=25000, chains=4)

# extract the traces
traces = fit.extract()
pystan_trace = [traces['alpha'], traces['beta'], traces['sigma']]

# save the traces with numpy
np.save("pystan_trace.npy", pystan_trace)

Overwriting pystan_example.py


In [16]:
# run the code we've created on the command-line
!python pystan_example.py

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_c1dba2ed7f485b674d7ce5eb738ffe05 NOW.
INFO:pystan:NOW ON CHAIN 0
INFO:pystan:NOW ON CHAIN 1
INFO:pystan:NOW ON CHAIN 2
INFO:pystan:NOW ON CHAIN 3

Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:     1 / 25000 [  0%]  (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Scale parameter is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:     1 / 25000 [  0%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  2500 / 25000 [ 10%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  5000 / 25000 [ 20%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration:  7500 / 25000 [ 30%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 10000 / 25000 [ 40%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 12500 / 25000 [ 50%]  (Warmup)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 15000 / 25000 [ 60%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 17500 / 25000 [ 70%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 20000 / 25000 [ 80%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 22500 / 25000 [ 90%]  (Sampling)
Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.664267 seconds (Warm-up)
0.724006 seconds (Sampling)
1.38827 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.6626 seconds (Warm-up)
0.781297 seconds (Sampling)
1.4439 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.669715 seconds (Warm-up)
0.799694 seconds (Sampling)
1.46941 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling)
Elapsed Time: 0.667423 seconds (Warm-up)
0.795613 seconds (Sampling)
1.46304 seconds (Total)



As we can see, the execution takes ~6 sec in serial to draw 100,000 samples. Additionally, on my machine, the compilation phase takes about ~20 seconds before the model is run.

In [17]:
# load the results from file; plot as above
plot_MCMC_results(xdata, ydata, pystan_trace)


Again, the results look similar to what we've seen above.

Summary: Comparing the Three Methods¶

As a more direct comparison of the results, let's quickly over-plot the contours derived by the three methods:

In [18]:
fig, ax = plt.subplots(figsize=(8, 8))
plot_MCMC_trace(ax, xdata, ydata, emcee_trace, True,
colors='blue', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pymc_trace,
colors='red', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pystan_trace,
colors='green', linewidths=2)
ax.legend(ax.collections[::2], ['emcee', 'pymc', 'pystan'], fontsize=16);


As we would expect, the results agree! This indicates that we've defined the models consistently in all three packages. Additionally, we see that the "true" values used to generate the distribution of points (25, 0.5) fall within the 1-$$\sigma$$ error ellipse.

Evaluating the Packages¶

So which package is the best? The answer to that will likely depend on what your problem is. Here's a quick table summarizing my own impressions of the three packages:

Complexity Execution time (100,000 samples; includes burn-in) Ease of Installation Learning Curve/Ease of Use Number of Features
emcee Very lightweight ~6 sec Pure python; installs easily with pip Straightforward & Pythonic not much built-in beyond basic MCMC sampling
pymc2 Lots of functionality & options ~17 sec Requires fortran compiler; binaries available via conda Pythonic, but lots of pymc-specific boilerplate Lots of built-in functionality in Python
pystan Large package; requires coding in Stan language ~20 sec compilation + ~6 sec computation Requires C compiler + Stan installation; no binaries available Not pure Python; must learn Stan model specification language Lots of built-in functionality in Stan-specific language

More verbosely:

emcee is extremely lightweight, and that gives it a lot of power. All you need to do is define your log-posterior (in Python) and emcee will sample from that distribution. Because it's pure-Python and does not have specially-defined objects for various common distributions (i.e. uniform, normal, etc.) I thought it might be slower than the others, but its performance on this problem is impressive. This is perhaps due to the more sophisticated default sampling scheme, so the benchmark may not be a fair comparison.

pymc is more full-featured, and once you get the hang of the decorator syntax for defining variables, distributions, and derived quantities, it is very easy to use. Its performance lagged the other two: the same query took several times longer, despite having optimized objects for sampling from various priors. From what I hear, though, the still-in-alpha PyMC version 3 – a complete rewrite of the package – blows PyMC version 2 out of the water.

pystan is the most difficult of the three to use, but that's because it's not really a Python package. Models are specified not in Python, but in a custom statistical expression language. This language is very powerful, but has a steep learning curve. The ~20sec compilation time for each model is annoying, but I suspect that as models get more complicated and sample sizes grow, this won't seem like such an issue. The fact that Stan is specifically designed for this type of operation, and the fact that all its models compile directly to byte code, makes it seem like a reasonable choice for large, specialized computations.

I wouldn't put much weight on the timings here; as mentioned above, the packages use very different default sampling schemes (Metropolis-Hastings for PyMC, no U-Turn sampling for PyStan, and affine-invariant ensemble sampling for emcee) so the times are not directly comparable. Each of these sampling schemes has its advantages and disadvantages, which you can read about in the links in above sections.

I hope that I've given a fair comparison of these packages here. Note that I'm not an expert or contributor to any of these packages: I've simply read the online documentation and adapted the tutorial examples to the specific problem I proposed here. If any power users or package developers are reading this and have feedback about how to better use or represent any of these packages, I'd love to hear your response! I'd say two things:

1. Please write a blog comment and let me know how to use your package more effectively!
2. Please realize that the average user likely uses your package as I did here: by reading the intro tutorials, and adapting those examples to their own problems. If that has led me to do something silly, it's likely that many of your users are making the same silly mistake!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view on nbviewer.

Paul Ivanov

starting my job search

I am starting to look for a job in the San Francisco Bay Area.

Since many recruiters ask for and presumably look at GitHub profiles, I decided to give mine a little facelift:

In case you aren't familiar, that banner was motivated by Joel Spolsky's Smart and Gets Things Done, which is a book about hiring good developers . So I decided to tweet it out, mentioning @spolsky and he favorited it!

Yesterday, I decided to tweet out an image that's at the top of my resume as a standalone tweet- mentioning Joel Spolsky again, and he liked it well enough to retweet it to his 90 thousand followers, so it's been getting plenty of love.

Perhaps unsurprisingly, the only person to contact me as a result of this so far is a reporter from Business Insider :

My editor would like to post it on our site as an example of a creative way to format a resume... I'm wondering if we can get your permission to do this?

So that's what prompted this post: I simply added my name and a Creative Commons Attribution Licence (CC-BY) to the two images, and then sent my permission along.

Outside of that, no prospective employers have gotten in touch. But like I always say: you can't win the lottery if you don't buy a ticket. And since I also enjoy mixing metaphors, I'll just keep on fishing!

Maheshakya Wijewardena

Performance evaluation of Approximate Nearest Neighbor search implementations - Part 2

This continues from the post Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1. The evaluation about the memory consumption is already completed in that post. In this post, the next two aspects of the evaluation framework, precision and query speed will be discussed.

When measuring the performance of Approximate nearest neighbor search methods, expressing precision and the query speed independently is less useful since the entire idea of approximating is to obtain the desired precision within a better time period. In order to evaluate precision, an ANN implementation should be able to provide a multiple number of neighbors(rather than just the nearest neighbor). After obtaining all data points in the data set from the ANN method, first few entires in the that neighbors list(ten neighbors in my evaluation tests) are taken as the neighbors of ground truth of that particular ANN method. These set of data points are compared against neighbors retrieved when the number of queried neighbors is varied. Precision tests are adapted from the tests performed for ANNOY.

This precision measure eliminates some of our candidate ANN implementations because those are not capable of producing a multiple number of neighbors. Obtaining multiple neighbors is an essential requirement of for precision tests described above as well the general applications of nearest neighbor search. Therefore, for the precision tests, only ANNOY, FLANN, KGraph and LSH Forest are taken into consideration. All evaluation tests, LSH forest implementation and the plots can be found in this Github repository.

Before jumping into comparisons, I thought it is imperative to get a notion on the characteristics of the LSH forest implementation. Unlike other ANN implementations, LSH forest provides some user controlled parameters to tune the forest and queries in different scenarios.

For the entire forest:

• Number of trees
• Maximum hash length

For queries:

• c : A value which determines the number of candidates chosen into the neighbors set. This acts with the number of trees.
• A lower bound for the maximum depth of hashes considered.

In these precision tests, all the those factors but c are fixed to constant values as follows:

• Number of trees = 10
• Maximum hash length = 32
• Lower bound = 4

The same data set which has been prepared using singular value decomposition on movielens data is used in all of these tests. Following are the resulting graphs of the performance of LSH forest. Time is measured in seconds.

The next section of this post illustrates the performance comparisons among ANNOY, FLANN, LSH forest and KGraph. Precision vs. time graphs are used for this comparison. Comparing ANNOY, FLANN and LSHF in one place:

KGraph has a significantly higher precision rate than other ANN implementations.(Rather than approximating, it gives the actual nearest neighbors according the KGraph documentation )

One of the main considerations of these evaluations is the maintainability of the code. Therefore, any implementation which goes into scikit-learn should have reasonably less complex code. Both FLANN and KGraph use complex data structures and algorithms to achieve higher speeds. ANNOY has a reasonably passable precision-query speed combination with a less complex implementation. Our current implementation of LSH forest has been able to beat ANNOY in precision-query speed comparison.

Indexing speeds of ANN implementations

In addition to precision and query speed, a measure of indexing speed has a major importance. Therefore as a final step for this evaluation, I will provide you a description on indexing speed for the same data set used above for each ANN implementation.

• KGraph : 65.1286380291 seconds
• ANNOY (metric='Angular') : 47.5299789906 seconds
• FLANN : 0.314314842224 seconds
• LSHF : 3.73900985718 seconds

In my next post, I will discuss about the implementation of LSH forest in detail and how ANN methods will be implemented in scikit-learn.

Acronyms:

1. ANN : Approximate Nearest Neighbors
2. LSH : Locality Sensitive Hashing

June 13, 2014

Manoj Kumar

manojbits

Hi, The first part of the last few days of have been the most productive and the last part well not so much.

1. Fixing precompute for ElasticNetCV
The function argument precompute=”auto” was being unused, in ElasticNetCV as mentioned in my previous post. Setting precompute equals auto uses a gram variant of the input matrix which according to the documentation is np.dot(X.T, X) . This theoritically helps the descent algorithm to converge faster. (However at the time of writing, I do not know exactly how). Practically however, (and after testing using the line profiler) it seems to be a bit slower since the computation of the gram matrix takes quite a bit of time. So with the advice of ogrisel, I split it across three Pull Requests. All three are essentially easy to fix.

1. https://github.com/scikit-learn/scikit-learn/pull/3247 – This ensures that the Gram variant is used if precompute is said to True or (auto and if n_samples > n_features)
2. https://github.com/scikit-learn/scikit-learn/pull/3248 – Remove precompute from Multi Task models since it is being unused,
3. https://github.com/scikit-learn/scikit-learn/pull/3249 – This is a WIP, and changes default precompute from auto to False

2. Threading backend for Linear Models.
I have successfully changed the backend from muli-processing to threading after releasing the GIL for all four variants. After a final round of review it can be merged,
a] Simple coordinate descent
b] Sparse coordinate descent
c] Gram variant
There is a huge memory gain and speed is almost the same (if not slightly higher by this Pull Request). https://github.com/scikit-learn/scikit-learn/pull/3102

3. Logistic Regression CV
Reading someone else’s magically vectorised NumPy code isn’t an easy task and I somehow crawled my way through it (which explains the more productive first part).

I fixed a bug in the code to compute the Hessian when the intercept is true. I’ve also fixed sparse matrix support and added multiple tests to it and confirmted that the newton-cg and lbfgs solvers give the exact same result, The liblinear has a slight change due to the penalisation of the intercept.

However benchmarking gives ambiguous results. On standard datasets such as the newsgroup and digits data, almost always the lib-linear solver is the fastest. However in datasets using make_classification, lbfgs seems to be the faster solver.

Right now, my job is just to wait for comments from Alex and Olivier and make the necessary changes. I shall come up with a more detailed description on Log Reg CV next week.

(Week 3) GSoC 2014: Extending Neural Networks Module for Scikit learn

This week, with the help of many reviewers I completed a user friendly multi-layer perceptron algorithm in Python. While it is still a Pull Request , the algorithm can be downloaded by following these steps,

1) git clone https://github.com/scikit-learn/scikit-learn
2) cd scikit-learn/
3) git fetch origin refs/pull/3204/head:mlp
4) git checkout mlp

Creating an MLP classifier is easy. First, import the scikit-learn library; then, initialize an MLP classifier by executing these statements,

from sklearn.neural_network import MultilayerPerceptronClassifier
clf = MultilayerPerceptronClassifier()

If you'd like to have 3 hidden layers of sizes 150-100-50, create an MLP object using this statement,

clf = MultilayerPerceptronClassifier(n_hidden=[150, 100, 50])

Training and testing are done the same way as any learning algorithm in scikit-learn.

In addition, you can tune many parameters for your classifier. Some of the interesting ones are,

1) the algorithm parameter, which allows users to select the type of algorithm for optimizing the neural network weights, which is either stochastic gradient descent SGD and l-bfgs; and

2) the max_iter parameter, which allows users to set the number of maximum iterations the network can run.

After training mlp, you can view the minimum cost achieved by printing mlp.cost_. This gives an idea of how well the algorithm has trained.

The implementation passed high standard tests and achieved expected performance for the MNIST dataset. MLP with one hidden layer of 200 neurons, and 400 iterations achieved great results in the MNIST benchmark compared to other algorithms shown below,

Classification performance
=========================================================================
Classifier                         train-time                          test-time                       error-rate
-------------------------------------------------------------------------------------------------------------
Multi layer Perceptron         655.5s                            0.30s                            0.0169
nystroem_approx_svm         125.0s                            0.91s                            0.0239
ExtraTrees                             79.9s                           0.34s                            0.0272
fourier_approx_svm             148.9s                            0.60s                            0.0488
LogisticRegression                68.9s                            0.14s                            0.0799

For next week, I will implement extreme learning machine algorithms. These use the least-square solution approach for training neural networks, and therefore they are both quick and efficient with interesting applications.

June 12, 2014

Jake Vanderplas

Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix

In Douglas Adams' classic Hitchhiker's Guide to the Galaxy, hyper-intelligent pan-dimensional beings build a computer named Deep Thought in order to calculate "the Answer to the Ultimate Question of Life, the Universe, and Everything". After seven and a half million years spinning its hyper-dimensional gears, before an excited crowd, Deep Thought finally outputs the answer:

42

The disappointed technicians, who trained a lifetime for this moment, are stupefied. They probe Deep Though for more information, and after some back-and-forth, the computer responds: "once you do know what the question actually is, you'll know what the answer means."

An answer does you no good if you don't know the question.

I find this story be an apt metaphor for statistics as sometimes used in the scientific literature. When trying to estimate the value of an unknown parameter, the frequentist approach generally relies on a confidence interval (CI), while the Bayesian approach relies on a credible region (CR). While these concepts sound and look very similar, their subtle difference can be extremely important, as they answer essentially different questions.

Like the poor souls hoping for enlightenment in Douglas Adams' universe, scientists often turn the crank of frequentism hoping for useful answers, but in the process overlook the fact that in science, frequentism is generally answering the wrong question. This is far from simple philosophical navel-gazing: as I'll show, it can have real consequences for the conclusions we draw from observed data.

Confidence vs. Credibility¶

In part I of this series, we discussed the basic philosophical difference between frequentism and Bayesianism: frequentists consider probability a measure of the frequency of (perhaps hypothetical) repeated events; Bayesians consider probability as a measure of the degree of certainty about values. As a result of this, speaking broadly, frequentists consider model parameters to be fixed and data to be random, while Bayesians consider model parameters to be random and data to be fixed.

These philosophies fundamenally affect the way that each approach seeks bounds on the value of a model parameter. Because the differences here are subtle, I'll go right into a simple example to illustrate the difference between a frequentist confidence interval and a Bayesian credible region.

Example 1: The Mean of a Gaussian¶

Let's start by again examining an extremely simple problem; this is the same problem we saw in part I of this series: finding the mean of a Gaussian distribution. Previously we simply looked at the (frequentist) maximum likelihood and (Bayesian) maximum a posteriori estimates; here we'll extend this and look at confidence intervals and credibile regions.

Here is the problem: imagine you're observing a star that you assume has a constant brightness. Simplistically, we can think of this brightness as the number of photons reaching our telescope in one second. Any given measurement of this number will be subject to measurement errors: the source of those errors is not important right now, but let's assume the observations $$x_i$$ are drawn from a normal distribution about the true brightness value with a known standard deviation $$\sigma_x$$.

Given a series of measurements, what are the 95% (i.e. $$2\sigma$$) limits that we would place on the brightness of the star?

1. The Frequentist Approach¶

The frequentist approach to this problem is well-known, and is as follows:

For any set of $$N$$ values $$D = \{x_i\}_{i=1}^N$$, an unbiased estimate of the mean $$\mu$$ of the distribution is given by

$\bar{x} = \frac{1}{N}\sum_{i=1}^N x_i$

The sampling distribution describes the observed frequency of the estimate of the mean; by the central limit theorem we can show that the sampling distribution is normal; i.e.

$f(\bar{x}~||~\mu) \propto \exp\left[\frac{-(\bar{x} - \mu)^2}{2\sigma_\mu^2}\right]$

where we've used the standard error of the mean,

$\sigma_\mu = \sigma_x / \sqrt{N}$

The central limit theorem tells us that this is a reasonable approximation for any generating distribution if $$N$$ is large; if our generating distribution happens to be Gaussian, it also holds for $$N$$ as small as 2.

Let's quickly check this empirically, by looking at $$10^6$$ samples of the mean of 5 numbers:

In [20]:
import numpy as np

N = 5
Nsamp = 10 ** 6
sigma_x = 2

np.random.seed(0)
x = np.random.normal(0, sigma_x, size=(Nsamp, N))
mu_samp = x.mean(1)
sig_samp = sigma_x * N ** -0.5

print("{0:.3f} should equal {1:.3f}".format(np.std(mu_samp), sig_samp))

0.894 should equal 0.894



It checks out: the standard deviation of the observed means is equal to $$\sigma_x N^{-1/2}$$, as expected.

From this normal sampling distribution, we can quickly write the 95% confidence interval by recalling that two standard deviations is roughly equivalent to 95% of the area under the curve. So our confidence interval is

$CI_{\mu} = \left(\bar{x} - 2\sigma_\mu,~\bar{x} + 2\sigma_\mu\right)$

Let's try this with a quick example: say we have three observations with an error (i.e. $$\sigma_x$$) of 10. What is our 95% confidence interval on the mean?

We'll generate our observations assuming a true value of 100:

In [21]:
true_B = 100
sigma_x = 10

np.random.seed(1)
D = np.random.normal(true_B, sigma_x, size=3)
print(D)

[ 116.24345364   93.88243586   94.71828248]



Next let's create a function which will compute the confidence interval:

In [22]:
from scipy.special import erfinv

def freq_CI_mu(D, sigma, frac=0.95):
"""Compute the confidence interval on the mean"""
# we'll compute Nsigma from the desired percentage
Nsigma = np.sqrt(2) * erfinv(frac)
mu = D.mean()
sigma_mu = sigma * D.size ** -0.5
return mu - Nsigma * sigma_mu, mu + Nsigma * sigma_mu

print("95% Confidence Interval: [{0:.0f}, {1:.0f}]".format(*freq_CI_mu(D, 10)))

95% Confidence Interval: [90, 113]



Note here that we've assumed $$\sigma_x$$ is a known quantity; this could also be estimated from the data along with $$\mu$$, but here we kept things simple for sake of example.

2. The Bayesian Approach¶

For the Bayesian approach, we start with Bayes' theorem:

$P(\mu~|~D) = \frac{P(D~|~\mu)P(\mu)}{P(D)}$

We'll use a flat prior on $$\mu$$ (i.e. $$P(\mu) \propto 1$$ over the region of interest) and use the likelihood

$P(D~|~\mu) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma_x^2}}\exp\left[\frac{(\mu - x_i)^2}{2\sigma_x^2}\right]$

Computing this product and manipulating the terms, it's straightforward to show that this gives

$P(\mu~|~D) \propto \exp\left[\frac{-(\mu - \bar{x})^2}{2\sigma_\mu^2}\right]$

which is recognizable as a normal distribution with mean $$\bar{x}$$ and standard deviation $$\sigma_\mu$$. That is, the Bayesian posterior on $$\mu$$ in this case is exactly equal to the frequentist sampling distribution for $$\mu$$.

From this posterior, we can compute the Bayesian credible region, which is the shortest interval that contains 95% of the probability. Here, it looks exactly like the frequentist confidence interval:

$CR_{\mu} = \left(\bar{x} - 2\sigma_\mu,~\bar{x} + 2\sigma_\mu\right)$

For completeness, we'll also create a function to compute the Bayesian credible region:

In [23]:
def bayes_CR_mu(D, sigma, frac=0.95):
"""Compute the credible region on the mean"""
Nsigma = np.sqrt(2) * erfinv(frac)
mu = D.mean()
sigma_mu = sigma * D.size ** -0.5
return mu - Nsigma * sigma_mu, mu + Nsigma * sigma_mu

print("95% Credible Region: [{0:.0f}, {1:.0f}]".format(*bayes_CR_mu(D, 10)))

95% Credible Region: [90, 113]



So What's the Difference?¶

The above derivation is one reason why the frequentist confidence interval and the Bayesian credible region are so often confused. In many simple problems, they correspond exactly. But we must be clear that even though the two are numerically equivalent, their interpretation is very different.

Recall that in Bayesianism, the probability distributions reflect our degree of belief. So when we computed the credible region above, it's equivalent to saying

"Given our observed data, there is a 95% probability that the true value of $$\mu$$ falls within $$CR_\mu$$" - Bayesians

In frequentism, on the other hand, $$\mu$$ is considered a fixed value and the data (and all quantities derived from the data, including the bounds of the confidence interval) are random variables. So the frequentist confidence interval is equivalent to saying

"There is a 95% probability that when I compute $$CI_\mu$$ from data of this sort, the true mean will fall within $$CI_\mu$$." - Frequentists

Note the difference: the Bayesian solution is a statement of probability about the parameter value given fixed bounds. The frequentist solution is a probability about the bounds given a fixed parameter value. This follows directly from the philosophical definitions of probability that the two approaches are based on.

The difference is subtle, but, as I'll discuss below, it has drastic consequences. First, let's further clarify these notions by running some simulations to confirm the interpretation.

Confirming the Bayesian Credible Region¶

To confirm what the Bayesian credible region is claiming, we must do the following:

1. sample random $$\mu$$ values from the prior
2. sample random sets of points given each $$\mu$$
3. select the sets of points which match our observed data
4. ask what fraction of these $$\mu$$ values are within the credible region we've constructed.

In code, that looks like this:

In [24]:
# first define some quantities that we need
Nsamples = 2E7
N = len(D)
sigma_x = 10

# if someone changes N, this could easily cause a memory error
if N * Nsamples > 1E8:
raise ValueError("Are you sure you want this many samples?")

# eps tells us how close to D we need to be to consider
# it a matching sample. The value encodes the tradeoff
# between bias and variance of our simulation
eps = 0.5

# Generate some mean values from the (flat) prior in a reasonable range
np.random.seed(0)
mu = 80 + 40 * np.random.random(Nsamples)

# Generate data for each of these mean values
x = np.random.normal(mu, sigma_x, (N, Nsamples)).T

# find data which matches matches our "observed" data
x.sort(1)
D.sort()
i = np.all(abs(x - D) < eps, 1)
print("number of suitable samples: {0}".format(i.sum()))

number of suitable samples: 528


In [25]:
# Now we ask how many of these mu values fall in our credible region
mu_good = mu[i]
CR = bayes_CR_mu(D, 10)
within_CR = (CR[0] < mu_good) & (mu_good < CR[1])
print "Fraction of means in Credible Region: {0:.3f}".format(within_CR.sum() * 1. / within_CR.size)

Fraction of means in Credible Region: 0.949



We see that, as predicted, roughly 95% of $$\mu$$ values with data matching ours lie in the Credible Region.

The important thing to note here is which of the variables is random, and which are fixed. In the Bayesian approach, we compute a single credible region from our observed data, and we consider it in terms of multiple random draws of $$\mu$$.

Confirming the frequentist Confidence Interval¶

Confirmation of the interpretation of the frequentist confidence interval is a bit less involved. We do the following:

1. draw sets of values from the distribution defined by the single true value of $$\mu$$.
2. for each set of values, compute a new confidence interval.
3. determine what fraction of these confidence intervals contain $$\mu$$.

In code, it looks like this:

In [26]:
# define some quantities we need
N = len(D)
Nsamples = 1E4
mu = 100
sigma_x = 10

# Draw datasets from the true distribution
np.random.seed(0)
x = np.random.normal(mu, sigma_x, (Nsamples, N))

# Compute a confidence interval from each dataset
CIs = np.array([freq_CI_mu(Di, sigma_x) for Di in x])

# find which confidence intervals contain the mean
contains_mu = (CIs[:, 0] < mu) & (mu < CIs[:, 1])
print "Fraction of Confidence Intervals containing the mean: {0:.3f}".format(contains_mu.sum() * 1. / contains_mu.size)

Fraction of Confidence Intervals containing the mean: 0.951



We see that, as predicted, 95% of the confidence intervals contain the true value of $$\mu$$.

Again, the important thing to note here is which of the variables is random. We use a single value of $$\mu$$, and consider it in relation to multiple confidence intervals constructed from multiple random data samples.

Discussion¶

We should remind ourselves again of the difference between the two types of constraints:

• The Bayesian approach fixes the credible region, and guarantees 95% of possible values of $$\mu$$ will fall within it.
• The frequentist approach fixes the parameter, and guarantees that 95% of possible confidence intervals will contain it.

Comparing the frequentist confirmation and the Bayesian confirmation above, we see that the distinctions which stem from the very definition of probability mentioned above:

• Bayesianism treats parameters (e.g. $$\mu$$) as random variables, while frequentism treats parameters as fixed.
• Bayesianism treats observed data (e.g. $$D$$) as fixed, while frequentism treats data as random variables.
• Bayesianism treats its parameter constraints (e.g. $$CR_\mu$$) as fixed, while frequentism treats its constraints (e.g. $$CI_\mu$$) as random variables.

In the above example, as in many simple problems, the confidence interval and the credibility region overlap exactly, so the distinction is not especially important. But scientific analysis is rarely this simple; next we'll consider an example in which the choice of approach makes a big difference.

Example 2: Jaynes' Truncated Exponential¶

For an example of a situation in which the frequentist confidence interval and the Bayesian credibility region do not overlap, I'm going to turn to an example given by E.T. Jaynes, a 20th century physicist who wrote extensively on statistical inference in Physics. In the fifth example of his Confidence Intervals vs. Bayesian Intervals (pdf), he considers a truncated exponential model. Here is the problem, in his words:

A device will operate without failure for a time $$\theta$$ because of a protective chemical inhibitor injected into it; but at time $$\theta$$ the supply of the chemical is exhausted, and failures then commence, following the exponential failure law. It is not feasible to observe the depletion of this inhibitor directly; one can observe only the resulting failures. From data on actual failure times, estimate the time $$\theta$$ of guaranteed safe operation...

Essentially, we have data $$D$$ drawn from the following model:

$p(x~|~\theta) = \left\{ \begin{array}{lll} \exp(\theta - x) &,& x > \theta\\ 0 &,& x < \theta \end{array} \right\}$

where $$p(x~|~\theta)$$ gives the probability of failure at time $$x$$, given an inhibitor which lasts for a time $$\theta$$. Given some observed data $$D = \{x_i\}$$, we want to estimate $$\theta$$.

Let's start by plotting this model for a particular value of $$\theta$$, so we can see what we're working with:

In [27]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def p(x, theta):
return (x > theta) * np.exp(theta - x)

x = np.linspace(5, 18, 1000)
plt.fill(x, p(x, 10), alpha=0.3)
plt.ylim(0, 1.2)
plt.xlabel('x')
plt.ylabel('p(x)');


Imagine now that we've observed some data, $$D = \{10, 12, 15\}$$, and we want to infer the value of $$\theta$$ from this data. We'll explore four approaches to this below.

1. Common Sense Approach¶

One general tip that I'd always recommend: in any problem, before computing anything, think about what you're computing and guess what a reasonable solution might be. We'll start with that here. Thinking about the problem, the hard cutoff in the probability distribution leads to one simple observation: $$\theta$$ must be smaller than the smallest observed value.

This is immediately obvious on examination: the probability of seeing a value less than $$\theta$$ is zero. Thus, a model with $$\theta$$ greater than any observed value is impossible, assuming our model specification is correct. Our fundamental assumption in both Bayesianism and frequentism is that the model is correct, so in this case, we can immediately write our common sense condition:

$\theta < \min(D)$

or, in the particular case of $$D = \{10, 12, 15\}$$,

$\theta < 10$

Any reasonable constraint on $$\theta$$ given this data should meet this criterion. With this in mind, let's go on to some quantitative approaches based on Frequentism and Bayesianism.

2. Frequentist approach #1: Sampling Distribution via the Normal Approximation¶

In the frequentist paradigm, we'd like to compute a confidence interval on the value of $$\theta$$. We can start by observing that the population mean is given by

$\begin{array}{ll} E(x) &= \int_0^\infty xp(x)dx\\ &= \theta + 1 \end{array}$

So, using the sample mean as the point estimate of $$E(x)$$, we have an unbiased estimator for $$\theta$$ given by

$\hat{\theta} = \frac{1}{N} \sum_{i=1}^N x_i - 1$

The exponential distribution has a standard deviation of 1, so in the limit of large $$N$$, we can use the standard error of the mean (as above) to show that the sampling distribution of $$\hat{\theta}$$ will approach normal with variance $$\sigma^2 = 1 / N$$. Given this, we can write our 95% (i.e. 2$$\sigma$$) confidence interval as

$CI_{\rm large~N} = \left(\hat{\theta} - 2 N^{-1/2},~\hat{\theta} + 2 N^{-1/2}\right)$

Let's write a function which will compute this, and evaluate it for our data:

In [28]:
from scipy.special import erfinv

def approx_CI(D, sig=0.95):
"""Approximate truncated exponential confidence interval"""
# use erfinv to convert percentage to number of sigma
Nsigma = np.sqrt(2) * erfinv(sig)
D = np.asarray(D)
N = D.size
theta_hat = np.mean(D) - 1
return [theta_hat - Nsigma / np.sqrt(N),
theta_hat + Nsigma / np.sqrt(N)]

In [29]:
D = [10, 12, 15]
print("approximate CI: ({0:.1f}, {1:.1f})".format(*approx_CI(D)))

approximate CI: (10.2, 12.5)



We immediately see an issue. By our simple common sense argument, we've determined that it is impossible for $$\theta$$ to be greater than 10, yet the entirety of the 95% confidence interval is above this range! Perhaps this issue is due to the small sample size: the above computation is based on a large-$$N$$ approximation, and we have a relatively paltry $$N = 3$$. Maybe this will be improved if we do the more computationally intensive exact approach. Let's try it:

3. Frequentist approach #2: Exact Sampling Distribution¶

Computing the confidence interval from the exact sampling distribution takes a bit more work.

For small $$N$$, the normal approximation will not apply, and we must instead compute the confidence integral from the actual sampling distribution, which is the distribution of the mean of $$N$$ variables each distributed according to $$p(\theta)$$. The sum of random variables is distributed according to the convolution of the distributions for individual variables, so we can exploit the convolution theorem and use the method of [characteristic functions](http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory) to find the following sampling distribution for the sum of $$N$$ variables distributed according to our particular $$p(x~|~\theta)$$:

$f(\theta~|~D) \propto \left\{ \begin{array}{lll} z^{N - 1}\exp(-z) &,& z > 0\\ 0 &,& z < 0 \end{array} \right\} ;~ z = N(\hat{\theta} + 1 - \theta)$

To compute the 95% confidence interval, we can start by computing the cumulative distribution: we integrate $$f(\theta~|~D)$$ from $$0$$ to $$\theta$$. This is relatively painless if we make use of the expression for the incomplete gamma function:

$\Gamma(a, x) = \int_x^\infty t^{a - 1}e^{-t} dt$

which looks strikingly similar to our $$f(\theta)$$.

Using this to perform the integral, we find that the cumulative distribution is given by

$F(\theta~|~D) = \frac{1}{\Gamma(N)}\left[ \Gamma\left(N, \max[0, N(\hat{\theta} + 1 - \theta)]\right) - \Gamma\left(N,~N(\hat{\theta} + 1)\right)\right]$

A contiguous 95% confidence interval $$(\theta_1, \theta_2)$$ satisfies the following equation:

$F(\theta_2~|~D) - F(\theta_1~|~D) = 0.95$

There are in fact an infinite set of solutions to this; what we want is the shortest of these. We'll add the constraint that the probability density is equal at either side of the interval:

$f(\theta_2~|~D) = f(\theta_1~|~D)$

(Jaynes claims that this criterion ensures the shortest possible interval, but I'm not sure how to prove that). Solving this system of two nonlinear equations will give us the desired confidence interval. Let's compute this numerically:

In [30]:
from scipy.special import gammaincc
from scipy import optimize

def exact_CI(D, frac=0.95):
"""Exact truncated exponential confidence interval"""
D = np.asarray(D)
N = D.size
theta_hat = np.mean(D) - 1

def f(theta, D):
z = theta_hat + 1 - theta
return (z > 0) * z ** (N - 1) * np.exp(-N * z)

def F(theta, D):
return gammaincc(N, np.maximum(0, N * (theta_hat + 1 - theta))) - gammaincc(N, N * (theta_hat + 1))

def eqns(CI, D):
"""Equations which should be equal to zero"""
theta1, theta2 = CI
return (F(theta2, D) - F(theta1, D) - frac,
f(theta2, D) - f(theta1, D))

guess = approx_CI(D, 0.68) # use 1-sigma interval as a guess
result = optimize.root(eqns, guess, args=(D,))
if not result.success:
print "warning: CI result did not converge!"
return result.x


As a sanity check, let's make sure that the exact and approximate confidence intervals match for a large number of points:

In [31]:
np.random.seed(0)
Dlarge = 10 + np.random.random(500)
print "approx: ({0:.3f}, {1:.3f})".format(*approx_CI(Dlarge))
print "exact: ({0:.3f}, {1:.3f})".format(*exact_CI(Dlarge))

approx: (9.409, 9.584)
exact: (9.408, 9.584)



As expected, the approximate solution is very close to the exact solution for large $$N$$, which gives us confidence that we're computing the right thing.

Let's return to our 3-point dataset and see the results:

In [32]:
print("approximate CI: ({0:.1f}, {1:.1f})".format(*approx_CI(D)))
print("exact CI:       ({0:.1f}, {1:.1f})".format(*exact_CI(D)))

approximate CI: (10.2, 12.5)
exact CI:       (10.2, 12.2)



The exact confidence interval is slightly different than the approximate one, but still reflects the same problem: we know from common-sense reasoning that $$\theta$$ can't be greater than 10, yet the 95% confidence interval is entirely in this forbidden region! The confidence interval seems to be giving us unreliable results.

We'll discuss this in more depth further below, but first let's see if Bayes can do better.

4. Bayesian Credibility Interval¶

For the Bayesian solution, we start by writing Bayes' rule:

$p(\theta~|~D) = \frac{p(D~|~\theta)p(\theta)}{P(D)}$

Using a constant prior $$p(\theta)$$, and with the likelihood

$p(D~|~\theta) = \prod_{i=1}^N p(x~|~\theta)$

we find

$p(\theta~|~D) \propto \left\{ \begin{array}{lll} N\exp\left[N(\theta - \min(D))\right] &,& \theta < \min(D)\\ 0 &,& \theta > \min(D) \end{array} \right\}$

where $$\min(D)$$ is the smallest value in the data $$D$$, which enters because of the truncation of $$p(x~|~\theta)$$. Because $$p(\theta~|~D)$$ increases exponentially up to the cutoff, the shortest 95% credibility interval $$(\theta_1, \theta_2)$$ will be given by

$\theta_2 = \min(D)$

and $$\theta_1$$ given by the solution to the equation

$\int_{\theta_1}^{\theta_2} N\exp[N(\theta - \theta_2)]d\theta = f$

this can be solved analytically by evaluating the integral, which gives

$\theta_1 = \theta_2 + \frac{\log(1 - f)}{N}$

Let's write a function which computes this:

In [33]:
def bayes_CR(D, frac=0.95):
"""Bayesian Credibility Region"""
D = np.asarray(D)
N = float(D.size)
theta2 = D.min()
theta1 = theta2 + np.log(1. - frac) / N
return theta1, theta2


Now that we have this Bayesian method, we can compare the results of the four methods:

In [34]:
print("common sense:         theta < {0:.1f}".format(np.min(D)))
print("frequentism (approx): 95% CI = ({0:.1f}, {1:.1f})".format(*approx_CI(D)))
print("frequentism (exact):  95% CI = ({0:.1f}, {1:.1f})".format(*exact_CI(D)))
print("Bayesian:             95% CR = ({0:.1f}, {1:.1f})".format(*bayes_CR(D)))

common sense:         theta < 10.0
frequentism (approx): 95% CI = (10.2, 12.5)
frequentism (exact):  95% CI = (10.2, 12.2)
Bayesian:             95% CR = (9.0, 10.0)



What we find is that the Bayesian result agrees with our common sense, while the frequentist approach does not. The problem is that frequentism is answering the wrong question. I'll discuss this more below, but first let's do some simulations to make sure the CI and CR in this case are correct.

Numerical Confirmation¶

To try to quell any doubts about the math here, I want to repeat the exercise we did above and show that the confidence interval derived above is, in fact, correct. We'll use the same approach as before, assuming a "true" value for $$\theta$$ and sampling data from the associated distribution:

In [35]:
from scipy.stats import expon

Nsamples = 1000
N = 3
theta = 10

np.random.seed(42)
data = expon(theta).rvs((Nsamples, N))
CIs = np.array([exact_CI(Di) for Di in data])

# find which confidence intervals contain the mean
contains_theta = (CIs[:, 0] < theta) & (theta < CIs[:, 1])
print "Fraction of Confidence Intervals containing theta: {0:.3f}".format(contains_theta.sum() * 1. / contains_theta.size)

Fraction of Confidence Intervals containing theta: 0.953



As is promised by frequentism, 95% of the computed confidence intervals contain the true value. The procedure we used to compute the confidence intervals is, in fact, correct: our data just happened to be among the 5% where the method breaks down. But here's the thing: we know from the data themselves that we are in the 5% where the CI fails. The fact that the standard frequentist confidence interval ignores this common-sense information should give you pause about blind reliance on the confidence interval for any nontrivial problem.

For good measure, let's check that the Bayesian credible region also passes its test:

In [38]:
np.random.seed(42)
N = 1E7
eps = 0.1

theta = 9 + 2 * np.random.random(N)
data = (theta + expon().rvs((3, N))).T
data.sort(1)
D.sort()
i_good = np.all(abs(data - D) < eps, 1)

print("Number of good samples: {0}".format(i_good.sum()))

Number of good samples: 65


In [39]:
theta_good = theta[i_good]
theta1, theta2 = bayes_CR(D)

within_CR = (theta1 < theta_good) & (theta_good < theta2)
print("Fraction of thetas in Credible Region: {0:.3f}".format(within_CR.sum() * 1. / within_CR.size))

Fraction of thetas in Credible Region: 0.954



Again, we have confirmed that, as promised, ~95% of the suitable values of $$\theta$$ fall in the credible region we computed from our single observed sample.

Frequentism Answers the Wrong Question¶

We've shown that the frequentist approach in the second example is technically correct, but it disagrees with our common sense. What are we to take from this?

Here's the crux of the problem: The frequentist confidence interval, while giving the correct answer, is usually answering the wrong question. And this wrong-question approach is the result of a probability definition which is fundamental to the frequentist paradigm!

Recall the statements about confidence intervals and credible regions that I made above. From the Bayesians:

"Given our observed data, there is a 95% probability that the true value of $$\theta$$ falls within the credible region" - Bayesians

And from the frequentists:

"There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of $$\theta$$ will fall within it." - Frequentists

Now think about what this means. Suppose you've measured three failure times of your device, and you want to estimate $$\theta$$. I would assert that "data of this sort" is not your primary concern: you should be concerned with what you can learn from those particular three observations, not the entire hypothetical space of observations like them. As we saw above, if you follow the frequentists in considering "data of this sort", you are in danger at arriving at an answer that tells you nothing meaningful about the particular data you have measured.

Suppose you attempt to change the question and ask what the frequentist confidence interval can tell you given the particular data that you've observed. Here's what it has to say:

"Given this observed data, the true value of $$\theta$$ is either in our confidence interval or it isn't" - Frequentists

That's all the confidence interval means – and all it can mean! – for this particular data that you have observed. Really. I'm not making this up. You might notice that this is simply a tautology, and can be put more succinctly:

"Given this observed data, I can put no constraint on the value of $$\theta$$" - Frequentists

If you're interested in what your particular, observed data are telling you, frequentism is useless.

Hold on... isn't that a bit harsh?¶

This might be a harsh conclusion for some to swallow, but I want to emphasize that it is not simply a matter of opinion or idealogy; it's an undeniable fact based on the very philosophical stance underlying frequentism and the very definition of the confidence interval. If what you're interested in are conclusions drawn from the particular data you observed, frequentism's standard answers (i.e. the confidence interval and the closely-related $$p$$-values) are entirely useless.

Unfortunately, most people using frequentist principles in practice don't seem to realize this. I'd point out specific examples from the astronomical literature, but I'm not in the business of embarassing people directly (and they're easy enough to find now that you know what to look for). Many scientists operate as if the confidence interval is a Bayesian credible region, but it demonstrably is not. This oversight can perhaps be forgiven for the statistical layperson, as even trained statisticians will often mistake the interpretation of the confidence interval.

I think the reason this mistake is so common is that in many simple cases (as I showed in the first example above) the confidence interval and the credible region happen to coincide. Frequentism, in this case, correctly answers the question you ask, but only because of the happy accident that Bayesianism gives the same result for that problem.

Now, I should point out that I am certainly not the first person to state things this way, or even this strongly. The Physicist E.T. Jaynes was known as an ardent defender of Bayesianism in science; one of my primary inspirations for this post was his 1976 paper, Confidence Intervals vs. Bayesian Intervals (pdf). More recently, statistician and blogger W.M. Briggs posted a diatribe on arXiv called It's Time To Stop Teaching Frequentism to Non-Statisticians which brings up this same point. It's in the same vein of argument that Savage, Cornfield, and other outspoken 20th-century Bayesian practitioners made throughout their writings, talks, and correspondance.

So should you ever use confidence intervals at all? Perhaps in situations (such as analyzing gambling odds) where multiple data realizations are the reality, frequentism makes sense. But in most scientific applications where you're concerned with what one particular observed set of data is telling you, frequentism simply answers the wrong question.

The Moral of the Story: Frequentism and Science Do Not Mix¶

The moral of the story is that frequentism and Science do not mix. Let me say it directly: you should be suspicious of the use of frequentist confidence intervals and p-values in science. In a scientific setting, confidence intervals, and closely-related p-values, provide the correct answer to the wrong question. In particular, if you ever find someone stating or implying that a 95% confidence interval is 95% certain to contain a parameter of interest, do not trust their interpretation or their results. If you happen to be peer-reviewing the paper, reject it. Their data do not back-up their conclusion.

If you have made it this far, I thank you for reading! I hope that this exploration succeeded in clarifying that the philosophical assumptions of frequentism and Bayesianism lead to real, practical consequences in the analysis and interpretation of data. If this post leads just one researcher to stop using frequentist confidence intervals in their scientific research, I will consider it a success.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view on nbviewer. Many thanks to David W. Hogg for some helpful critiques on an early version of this post.

Philip Herron

redbrain812

Assembly programming is a black hole for new programmers these days, its almost impossible to find really easy to follow tutorial on the subject but some people in work were talking about it and i felt i could shed some light on it.

Consider a really simple program in C

int main (void)
{
return 2;
}


You can compile this:

$gcc -g -O2 -Wall test.c -o test  And whoop we compiled a very useless program but when we run it we get: $ ./test; echo $? 2  So we done something! But how does all of that work i mean how was this able to return a value to the shell. We could write this in assembly (i386/x86) and i can show you what happens here to return this value its call the c-decl ABI (http://en.wikipedia.org/wiki/X86_calling_conventions) .code32 .global main main: mov$2, %eax
ret


To compile this you simply:

$as fib2.s -o fib2.o$ gcc -o test fib2.o
$./test; echo$?
2


So what does this mean, well i am sure it looks some what clear whats going on, we declared a global symbol main and its code simply puts the value 2 into the register eax and returns. The ret opcode is a fairly useful one, it is responsible for restoring the instruction pointer for the program back to the caller. You can do all of this manually by messing with the instruction pointer yourself but thats for another day see (http://pdos.csail.mit.edu/6.893/2009/readings/i386/RET.htm)

In the next post i will write how we can call into libc functions such as printf for a hello world from assembly in i386. This will introduce the stack.

redbrain812

Hey all reading this, its a fairly personal post and probably controversial for me to make.

About a year ago i was working for a great wee company WANdisco i had such a great set of people to work with, we were all so close and work was such a joy to go to. Exciting and vibrant when i was working there for bit over a year and i started to yearn for a more ‘low-level’ challenge such as a low-latency C/C++ application. I felt i had so much to learn in this area and its where i wanted my career to go towards.

So i joined where i am now (not going to write to avoid google picking it up) and i began to realise the pain of legacy systems programming and in my opinion i found it a challenge to some extent but it was never the code base i had issues with i mean with my experience in open source and GCC which is older and bigger than most legacy code-bases you will ever find. The issues i had were more high-level, we were never given freedom to be pro-active about maintaining code (fixing things before they become an issue). This lead to much time being very idle and frustrated at being quite literately un-challenged. And this was seen to be ‘busy’ with everyone embellishing their job and work to sound better to hopefully eventually gain some kind of promotion or some kind of management role.

And though this job i found such frustration i became very depressed to the point i though almost that Software Engineering isn’t a career path that i would want to pursue any longer….

But now its all fixed and for the first time in probably 2 years i feel settled, i mean i went to lunch and everything seemed brighter and better than i have felt in so long and i have been smiling all day. I never realised how depressed i had become, its so dangerous when i bottled things up so much that it would personify as physical illness such as bad colds and coughs on and off. I would bottle up so much because my Significant Other i really didn’t want for her to have to deal with my problems i love her so much and i just want to make her happy but my career was bringing us both down.

But you know what i finally have a new job where i will be starting in about a month (will reveal closer to the time). And you know what the people were so nice to me and challenged me and made me work in the interview i felt at peace again. Like i knew what i was doing again being proactive challenging myself and had the desire to push myself. And more importantly understood why i participate in open-source, at this job very very frequently you will be forced to deal with this question ‘sure, why would you do that? Were you paid? No? What…?’.

Its probably important to note, their reasons were i guess to some extent legitimate such as project approvals from such a large corporate entity and product stability. But regardless of this i would still argue that failure to fix and update is kind of like buying a house that really needs the plumbing and electrics re-done, sure you can live in it and use it but its not that pleasant and has a very high likely hood of costing you a lot in the long run.

Though this year i have talked to many many people on-line and off-line; strangers who become friends on a bus from different walks of life and different careers. And the universal notion we all have had is that, career will only take you so far because in the end it is we who have to live our lives and we can’t let circumstances get in the way of enjoying life.

My depression at this job was so great i believe if it went on for another year or 2 it probably was getting close to costing me my relationship with my Significant Other and maybe even my life. I probably shouldn’t go into the nitty details of what it was like here but if your in a similar position i would love to hear from you via email or if your local to Northern Ireland i would like to share a coffee with you.

Take care of your mental health i mean seriously the stigma it has in some countries still scares me such as Northern Ireland. Your likely to hear people equate Mental Health with the ‘luney bin’.

http://www.nidirect.gov.uk/mental-health

William Stein

I've added task list functionality to SageMathCloud (SMC), so you can keep track of a list of things to do related to a project, paper, etc. Task lists are saved as files on the filesystem, so they can be backed up in the usual way, automatically generated, etc. I doubt anybody is going to use SMC just for the tasks lists, but for people already using SMC in order to use Sage, write LaTeX documents, use IPython notebooks, etc., having a convenient integrated task list should come in handy.
To create a task list, in a project, click "+New", name the task list, then click the "Task List" button.  Here's what a task list looks like:

The Design

I used the task list quite a bit when implementing the task list, and significantly modified the interface many, many times. I also tried out numerous other todo list programs for inspiration. I eventually settled on the following key design choices, some of which are different than anything I've seen. In particular, the design targets highly technical users, which is not something I saw with any other todo list programs I tried.
• Markdown editor: The task description is formatted using client-side rendered Github flavored markdown (using marked), including [ ] for checkboxes. I also include full MathJax support, and spent quite a while working around various subtleties of mixing mathjax and markdown. I'll be rewriting Sage worksheets to use this code. The editor itself is Codemirror 4 in markdown mode, so it respects whatever theme you choose, has nice features like multiple cursors, paren matching, vim/emacs modes, etc. Multiple people can edit the same task at once and everybody will see the changes as they are made (note: I haven't implemented showing other cursors.)
• Relative dates and times: All dates and times are shown relative to right now. If something is due in 20 hours, it says "due about 20 hours from now". I also included a sortable column with the last time when a task was edited, also shown relative to the current time. This display uses the timeago jquery plugin. You can of course click on the due date to see the actual date.
• Hashtags: As I was implementing (and removing) features such as priorities, a way to indicate which task you are currently working on, folders, etc, I decided that hashtags can provide every feature. Moreover, they are very text editor/programmer friendly. When editing a task, if you put #foo, then it is interpreted as a hashtag, which you can click on to show only tasks containing that tag. To disambiguate with markdown headings, to make a heading you have to include a whitespace, so # foo. I haven't added autocomplete for hashtags yet, but will. You can easily click on hashtags anywhere to select them, or use the bar at the top.
• User-specified task order: The default sort order for tasks is custom. There is a drag handle so you can explicitly move tasks up and down in order to indicate how important they are to you (or whatever else). You can also click an up hand or down hand to knock the currently selected task to the bottom of the list of displayed tasks.
Of course, I still have an enormous list of functionality I would like to add, but that will have to wait. For example, I need to enable a chat associated to each task list, like the chats associated to Sage worksheets and other files. I also want to make it so one can select a range of tasks and copy them, move them, paste them into another list, etc. It would also be nice to be able to export task lists to HTML, PDF, etc., which should be fairly easy using pandoc.  I'm also considering making a note list, which is like a task list but without the due date or "done" box.  Because of all the infrastructure already there, it would be easy to add code evaluation functionality, thus creating something like worksheets, but from a different point of view (with maybe hashtags determining the Python process).

Databases and Differential Synchronization

One interesting thing I noticed when implementing task lists is that there are many similarities with the original sagenb.org design (and also IPython notebook), in which a worksheet is a list of cells that get evaluated, can be manually reordered, etc. Similarly, a task list is a list of "cells" that you edit, manually reorder, etc. With sagenb we had longstanding issues involving the order of each cell and assigning an integer numerical id (0, 1, 2, ...) to the cells, which resulted in things like cells seemingly getting randomly reordered, etc. Also, having multiple simultaneous viewers with automatic synchronization is very difficult with that model. For task lists, I've introduced some new models to address these issues.

A natural way to store a task list is in a database, and I initially spent some time coming up with a good database schema and implementing basic lists using Cassandra for the backend. However, I couldn't pursue this approach further, since I was concerned about how to implement realtime synchronization, and also about the lack of easily backing up complete task lists via snapshots, in git, etc. So instead I created an "object database" API built on top of a file that is synchronized across clients (and the server) using differential synchronization. The underlying file format for the database is straightforward -- there is one line in JSON format for each record in the database. When objects are changed, the file is suitably modified, synchronized to other clients, and events are triggered.

Since differential synchronization has no trouble with files that have "a few thousand lines", this approach works well for our purposes (since personal or shared todo lists are typically fairly short). Also, having one line per record is at least more git friendly than something like a sqlite database. I'm considering rewriting my implementation of IPython notebook sync on top of this abstraction.
Since I view the task list as a database, each task has a globally unique uuid. Also, instead of viewing the task order as being defined by an integer 0,1,2,3, which leads to all manner of bugs and programming misery, race conditions, etc., instead we view the order as being determined by floating point positions. So to insert a task between tasks with positions 1 and 2, we just give the task position 1.5.

June 11, 2014

Titus Brown

Thoughts on open science -- my response to Eli Kintisch

Eli Kintisch (@elikint) just wrote a very nice article on "Sharing in Science" for Science Careers; his article contained quotes from my MSU colleague Ian Dworkin as well as from me.

When Eli sent me an e-mail with some questions about open science, I responded at some length (hey, I type fast and have lots of opinions!). Having been through this before, I knew that he would use maybe four sentences at most, but that seemed like a waste! So I told him I'd like to post the entire e-mail response once his article came out. Here it is! I've edited a few things to make them clearer or 'cause I felt like it.

1. How has sharing code, data, R methods helped scientific research in one or two specific examples?

One personal story -- we have a technical approach (called digital normalization) that we have yet to publish. We posted a paper draft on a preprint site and made the software available almost 3 years ago, however. The approach makes certain kinds of difficult sequence analyses much easier, and makes some other previously impossible analyses now possible. (We've published some of these results -- see www.pnas.org/content/early/2014/03/13/1402564111.abstract -- just not the approach itself.)

Since we made the approach available, hundreds or thousands of people have used it. A derivative approach is now part of two major software packages (the Trinity and Mira assemblers), it's been used in about 15 publications, and is soon to appear in dozens more, and it's basically been quite a success. All prepub.

It's hard to cite another clear example in my area of expertise, in part because in genomics data and software tend to be freely available prepub, so it's more the default. There are big wins like the human genome project and the ENCODE project, where the data (and code, in the case of ENCODE) were made freely and openly available, and this has obviously accelerated science.

Hmm.

This is a good article to the state of things in my field (the top bit, at any rate):

http://massgenomics.org/2013/06/data-sharing-embargo.html

1. How has data sharing impacted your career? Do you think it could help you get tenure? Some say that faculty comittees do not consider this.

My career has developed in large part because I've been open about everything. In addition to making my software methods open, and posting my papers, and making all my educational materials available, I blog and tweet about everything. I'm pretty well known largely because of this.

So I would make three points.

One is, going forward, data (and code) sharing will be an expectation. Program managers and funding organizations are eager to see their funding put to maximum use, and this includes making as much data and code available as possible.

Second -- because of this, it will soon be the norm for new faculty. You will no longer stand out if you do it (like I do now) but if you don't do it (because then you will no longer get published or receive grants). It may take 5-10 years for this to be the new norm but I think it's inevitable. Of course, many details need to be worked out, but the incentives for doing so are going to be there.

Third -- everyone knows what tenure and promotion committees care about. It's funding and reputation (which usually means papers and invitations to speak). These can be beneficiaries of openness, and in fact I would argue that my success so far is a direct example: I have received grants where the reviewers cited my openness, and many of my invitations to go speak (i.e. reputation builders) are from my online presence. So I don't expect what committees care about to change, but I do expect the paths to those reputational results to expand in number and variety to include openness as one of the core mechanisms.

1. What barriers exist in your mind to more people sharing their work in this way?

Culture and training are the two main ones. The culture discourages people from doing it, because it's new and edgy and people don't know how to think about it. The training is a block because even people who are ideologically disposed towards openness (which is, in fact, most scientists, I believe) don't know where to get started.

1. Is R Open Science important and if so why?

I'm the wrong person to ask :). rOpenSci are doing things right, as far as I can tell, but I don't actually use R myself so I've never used their software. So from my perspective their main utility is in showing that their approach isn't fatal. Which is awesome

1. Any practical tips to scientists you'd share on making sharing easier?

Just ... get started! Post figures and posters to figshare. Make your published data openly available and easy to use, i.e. in the format that you would have wished for it to be delivered in for your paper. Make your methods explicit. Post preprints at bioRxiv if the journal is ok with it. If there's a technical barrier that prevents you from doing something, note it and move on, but don't let it stop you from doing something else.

--titus

June 10, 2014

Hamzeh Alsalhi

Sparsely Formated Target Data

There are different ways to represent target data, this week I worked on a system that converts the different formats to an easy to use matrix format. The pull request is here. My work on this system introduced support to have this data matrix optionally be represented sparsely. The final result when the pull request is completed will be support for sparse target data ready to be used by the up and coming sparse One vs. Rest classifiers.

The function of this data converter is to take multiclass or multilabel target data and represent it in a binary fashion so classifiers that work on binary data can be used with no modification. For example, target data might come from the user like this. With integer class: 1 for Car, 2-Airplane, 3-Boat, 4-Helicopter. We label each of the following 5 images with the appropriate class.
 2                        1                       3                     2                     4

This data in a list of list format would look like this, we list each images labels one after the other:

Y = [2,1,3,2,4]

This Label binarizer will give a matrix where each column is an indicator for the class and each row is an image/example.

[0,1,0,0] [1,0,0,0] Y =  [0,0,1,0]        [0,1,0,0] [0,0,0,1]

Before my pull request all conversions from label binarizer would give the above matrix in dense format as it appears. My pull request has made it so that the user can specify if they would like the matrix to be returned in sparse format, if so the matrix will be a sparse matrix and has the potential to save a lot of space and runtime depending on how sparse the target data is.

These two calls to the label binarizer illustrate how sparse output can be enabled, the first call will print a dense matrix the second call will return a sparse matrix.

Input:
Y_bin = label_binarize(y,classes=[1,2,3,4])print(type(Y_bin))print(Y_bin)
Output:
<type 'numpy.ndarray'>[[0 1 0 0] [1 0 0 0] [0 0 1 0] [0 1 0 0] [0 0 0 1]]

Input:
Y_bin = label_binarize(y,classes=[1,2,3,4],sparse_output=True)print(type(Y_bin))print(Y_bin)
Output:
<class 'scipy.sparse.csr.csr_matrix'>  (0, 1)    1  (1, 0)    1  (2, 2)    1  (3, 1)    1  (4, 3)    1

The next pull request for sparse One vs. Rest support is what motivated this update because we want to overcome runtime constraints on datasets with large amounts of labels causing extreme runtime and space requirements.

Thank you to the reviewers Arnaud, Joel, and Oliver for their comments this week and to Rohit for starting the code which I based my changes off of.

Jake Vanderplas

Is Seattle Really Seeing an Uptick In Cycling?

Update: as I was putting the finishing touches on this notebook, I noticed this post, the first in a series on Seattle Bike Blog which analyzes much of the same data used here. Apparently great minds think alike! (incidentally, to prove that I'm not just cribbing that content, check the github commit log: I wrote the bulk of this post several days before the SBB blog series was posted. Version control priority FTW!)

Update #2: I added error bars to the estimates in the final section (should have done that from the beginning, I know...

Cycling in Seattle seems to be taking off. This can be seen qualitatively in the increased visibility of advocacy groups like Seattle Neighborhood Greenways and Cascade Bicycle Club, the excellent reporting of sites like the Seattle Bike Blog, and the investment by the city in high-profile traffic safety projects such as Protected Bike Lanes, Road diets/Rechannelizations and the Seattle Bicycle Master Plan.

But, qualitative arguments aside, there is also an increasing array of quantitative data available, primarily from the Bicycle counters installed at key locations around the city. The first was the Fremont Bridge Bicycle Counter, installed in October 2012, which gives daily updates on the number of bicycles crossing the bridge: currently upwards of 5000-6000 per day during sunny commute days.

Bicycle advocates have been pointing out the upward trend of the counter, and I must admit I've been excited as anyone else to see this surge in popularity of cycling (Most days, I bicycle 22 miles round trip, crossing both the Spokane St. and Fremont bridge each way).

But anyone who looks closely at the data must admit: there is a large weekly and monthly swing in the bicycle counts, and people seem most willing to ride on dry, sunny summer days. Given the warm streak we've had in Seattle this spring, I wondered: are we really seeing an increase in cycling, or can it just be attributed to good weather?

Here I've set-out to try and answer this question. Along the way, we'll try to deduce just how much the weather conditions affect Seattleites' transportation choices.

A Quick Aside¶

If anyone is landing on this page via the normal bicycle advocacy channels, I should warn you that this won't look like a typical Seattle-bike-advocate blog post. I currently work as a data scientist at the University of Washington eScience Institute, where I'm incredibly fortunate to have the flexibility to spend a couple hours each week on side-projects like this. Most of my blog posts are pretty technical in nature: I tend to focus on statistical methods and visualization using the Python programming language.

This post is composed in an IPython notebook, which is a fully executable document which combines text, data, code, and visualizations all in one place. The nice thing is that anyone with a bit of desire and initiative could install the (free) IPython software on their computer and open this document, re-running and checking my results, and perhaps modifying my assumptions to see what happens. In a way, this post is as much about how to work with data as it is about what we learn from the data.

In other words, this is an entirely reproducible analysis. Every piece of data and software used here is open and freely available to anyone who wants to use it. It's an example of the direction I think data journalism should go as it starts to more and more emulate data-driven scientific research.

That said, there's a lot of technical stuff below. If you're not familiar with Python or other data analysis frameworks, don't be afraid to skip over the code and look at the plots, which I'll do my best to explain.

The Data¶

This post will use two datasets, which you can easily access with an internet connection. You can find the exact data I used in the GitHub repository, or access it from the original sources below.

First, I'll be using the Fremont Bridge Hourly Bicycle Counts. To download this data, go to the fremont bridge page, and do the following (I accessed this on June 6th, 2014):

• click "export"

Second, I'll be using weather data available at the National Climatic Data Center. We'll use weather data from the SeaTac Airport weather station. To get this data, go to the Climate Data Search page and do the following (I accessed this on June 6th, 2014):

• Choose "Daily Summaries"
• Choose 2012/10/1 to the present date
• Search for "Station", and type in "USW00024233" (ID for SeaTac Airport weather station)
• Click the icon on the map and "Add to Cart"
• go to "Shopping Cart"

• make sure date range is 2012/10/1 to 2014/5/14
• choose Custom GHCN-Daily CSV
• click "Continue"

• next page: click "select all"
• click "continue"
• enter email address and submit order

When the data set is ready, you will get an email with a download link. It was about an hour wait when I did it.

Examining the Fremont Bridge Data¶

The first thing we're going to do is load and examine the data from the Fremont bike counter. We'll use the pandas package, a free and open source set of data analysis tools for the Python language.

In [1]:
# some necessary imports

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
# Load the data file, and create a column with total north/south traffic

hourly = pd.read_csv("FremontHourly.csv", index_col='Date', parse_dates=True)
hourly.columns = ['northbound', 'southbound']
hourly['total'] = hourly['northbound'] + hourly['southbound']

In [3]:
# Resample the data into daily and weekly totals

daily = hourly.resample('d', 'sum')
weekly = daily.resample('w', 'sum')


Now let's take a peek at our data and see what it looks like:

In [4]:
weekly[['northbound', 'southbound', 'total']].plot()
plt.ylabel('Weekly riders');


The red line shows the total number of weekly crossings, which is the sum of the northbound and southbound crossings.

At first glance, April and May 2014 include some spikes in the data: over 32,000 riders per week crossed the bridge one week in May! This trend might be a bit clearer if we use a moving window average: basically, for each day we'll take the average of the 30-day period around it:

In [5]:
pd.stats.moments.rolling_mean(daily['total'], 30).plot();


This is the increased ridership that folks have been talking about. There is some seasonal variation, but the trend seems clear: 2014 has seen a lot of cyclists crossing the bridge.

But it is clear that there is still some seasonal variation. What we're going to try to do below is to model this variation based on our intuition about what factors might come into play in people's decision about whether to ride.

For simplicity, I'm going to stick with a linear model here. It would be possible to go deeper and use a more sophisticated model (I'd eventually like to try Random Forests), but a linear model should give us a good approximation of what's happening.

Step 1: Accounting for hours of daylight¶

The largest component of the variation we see is a seasonal swing. I'm going to hypothesize that that swing is at least partially due to the changing daylight hours. We'll compute the number of hours of daylight and use this to de-trend the data.

Fortunately, my PhD is in Astronomy, so I once-upon-a-time learned how to compute this:

In [6]:
# Define a function which returns the hours of daylight
# given the day of the year, from 0 to 365

def hours_of_daylight(date, axis=23.44, latitude=47.61):
"""Compute the hours of daylight for the given date"""
diff = date - pd.datetime(2000, 12, 21)
day = diff.total_seconds() / 24. / 3600
day %= 365.25
m = 1. - np.tan(np.radians(latitude)) * np.tan(np.radians(axis) * np.cos(day * np.pi / 182.625))
m = max(0, min(m, 2))
return 24. * np.degrees(np.arccos(1 - m)) / 180.

# add this to our weekly data
weekly['daylight'] = map(hours_of_daylight, weekly.index)
daily['daylight'] = map(hours_of_daylight, daily.index)

In [7]:
# Plot the daylight curve

weekly['daylight'].plot()
plt.ylabel('hours of daylight (Seattle)');


This looks reasonable: just over 8 hours of daylight in December, and just under 16 hours in June.

To get a feel for the trend, let's plot the daylight hours versus the weekly bicycle traffic:

In [8]:
plt.scatter(weekly['daylight'], weekly['total'])
plt.xlabel('daylight hours')
plt.ylabel('weekly bicycle traffic');


We see a clear trend, though it's also apparent from the wide vertical scatter that other effects are at play.

Let's apply a linear fit to this data. Basically, we'll draw a best-fit line to the points using some convenient tools in the scikit-learn package, which I've been active in developing:

In [9]:
from sklearn.linear_model import LinearRegression

X = weekly[['daylight']].to_dense()
y = weekly['total']
clf = LinearRegression(fit_intercept=True).fit(X, y)

weekly['daylight_trend'] = clf.predict(X)
weekly['daylight_corrected_total'] = weekly['total'] - weekly['daylight_trend'] + weekly['daylight_trend'].mean()

xfit = np.linspace(7, 17)
yfit = clf.predict(xfit[:, None])
plt.scatter(weekly['daylight'], weekly['total'])
plt.plot(xfit, yfit, '-k')
plt.title("Bicycle traffic through the year")
plt.xlabel('daylight hours')
plt.ylabel('weekly bicycle traffic');


Once such a linear model is fit, we can look at the model coefficients to see, on average, how the change in one variable affects the change in another:

In [10]:
print(clf.coef_[0])

2056.44964989



This tells us that according to this model, each extra hour of daylight leads to about 2000 more riders per week across the bridge! Of course, in Seattle the length of the day also correlates highly with temperature and precipitation; we'll try to untangle those effects later.

Now that we have fit this trend, let's subtract it off and replace it by the mean:

In [11]:
trend = clf.predict(weekly[['daylight']].as_matrix())
plt.scatter(weekly['daylight'], weekly['total'] - trend + np.mean(trend))
plt.plot(xfit, np.mean(trend) + 0 * yfit, '-k')
plt.title("weekly traffic (detrended)")
plt.xlabel('daylight hours')


This is what I mean by "de-trended" data. We've basically removed the component of the data which correlates with the number of hours in a day, so that what is left is in some way agnostic to this quantity. The "adjusted weekly count" plotted here can be thought of as the number of cyclists we'd expect to see if the hours of daylight were not a factor.

Let's visualize this another way. Instead of plotting the number of riders vs daylight hours, we'll again plot the number of riders vs the day of the year, along with the trend:

In [12]:
weekly[['total', 'daylight_trend']].plot()
plt.ylabel("total weekly riders");


We can similarly view the adjusted total number of riders over time by subtracting this green line from the blue line:

In [13]:
weekly['daylight_corrected_total'].plot()
rms = np.std(weekly['daylight_corrected_total'])
plt.ylabel("adjusted total weekly riders")
print("root-mean-square about trend: {0:.0f} riders".format(rms))

root-mean-square about trend: 3100 riders



With the data de-trended, we get a better idea of how bicycling in Seattle has changed over time, corrected for the seasonal variation.

Accounting for Day of the Week¶

Above we've been looking at weekly data. This is because daily data shows a clear swing as a function of the day of the week, which we'll show here.

In [14]:
days = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
daily['dayofweek'] = daily['total'].index.dayofweek

In [15]:
grouped = daily.groupby('dayofweek')['total'].mean()
grouped.index = days

grouped.plot()
plt.title("Average Traffic By Day")
plt.ylabel("Average Daily Crossings");