March 05, 2015

Continuum Analytics

Continuum Analytics - March Tech Events

This month, the Continuum team will be at a variety of meetups and conferences talking about the power of Python in analytics. Take a look at where you can find us, and reach out at info@continuum.io if you’re interested in meeting up, or if you would like us to participate in your event.

by Continuum at March 05, 2015 12:00 AM

February 28, 2015

Matthew Rocklin

Ising models and Numba

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr I play with Numba and find it effective.

Ising model with numba

Introduction

Confession, I’ve never actually used Numba. Well that’s not quite true; I’ve indirectly used Numba thousands of times because Blaze auto-generates numba ufuncs. Still I’ve never used it for a particular problem. I usually define problems with large array operations and compile those down. Numba takes a different approach and translates Python for loops to efficient LLVM code. This is all lower in the hardware stack than where I usually think.

But when I was looking for applications to motivate recent work in nearest-neighbor communications in dask a friend pointed me towards the Ising model, a simple physical system that is both easy to code up and has nice macro-scale properties. I took this as an example to play with Numba. This post details my experience.

Ising Model

Disclaimer: I am not a physicist

The Ising model represents a regular grid of points where each point has two possible states, spin up and spin down. States like to have the same spin as their immediate neighbors so when a spin-down state is surrounded by more spin-up states it will switch to spin-up and vice versa. Also, due to random fluctuations, points might switch spins, even if this switch is not favorable. In pseudocode an Ising update step might look like the following

for every point in the grid:
    energy = my spin * sum of all of the spins (+1 or -1) of neighboring points
    if energy is improved by switching:
        switch
    else if we're particulalry unlucky
        switch anyway

For this kind of algorithm imperative for-loopy code is probably the most clear. You can do this with high-level array operations too (e.g. NumPy/Blaze/Theano), but it’s a mess.

Numba code

Here is my solution to the problem with Numba and a gif of the solution

Ising model with numba

import numba
import numpy as np
from math import exp, log, e, sqrt

kT = 2 / log(1 + sqrt(2), e)

@numba.jit(nopython=True)
def _update(x, i, j):
    n, m = x.shape
    dE = 2* x[i, j] * (
                     x[(i-1)%n, (j-1)%m]
                   + x[(i-1)%n,  j     ]
                   + x[(i-1)%n, (j+1)%m]

                   + x[ i     , (j-1)%m]
                   + x[ i     , (j+1)%m]

                   + x[(i+1)%n, (j-1)%m]
                   + x[(i+1)%n,  j     ]
                   + x[(i+1)%n, (j+1)%m]
                   )
    if dE <= 0 or exp(-dE / kT) > np.random.random():
        x[i, j] *= -1

@numba.jit(nopython=True)
def update(x):
    n, m = x.shape

    for i in range(n):
        for j in range(0, m, 2):  # Even columns first to avoid overlap
            _update(x, i, j)

    for i in range(n):
        for j in range(1, m, 2):  # Odd columns second to avoid overlap
            _update(x, i, j)

if __name__ == '__main__':
    x = np.random.randint(2, size=(1000, 1000)).astype('i1')
    x[x == 0] = -1
    for i in range(100):
        update(x)

My old beater laptop executes one update step on a 1000x1000 grid in 50ms. Without Numba this takes 12s. This wasn’t a canned demo by an expert user / numba developer; this was just my out-of-the-box experience.

Thoughts

I really like the following:

  • I can remove @numba.jit and use the Python debugger
  • I can assert that I’m only using LLVM with nopython=True
  • I can manage data with NumPy (or dask.array) separately from managing computation with Numba

I ran in to some issues and learned some things too:

  • random is only present in the developer preview builds of Numba (conda install -c numba numba). It will be officially released in the 0.18 version this March.
  • You don’t have to provide type signature strings. I tried providing these at first but I didn’t know the syntax and so repeatedly failed to write down the type signature correctly. Turns out the cost of not writing it down is that Numba will jit whenever it sees a new signature. For my application this is essentially free.

February 28, 2015 12:00 AM

February 26, 2015

NeuralEnsemble

Workshop Announcement - "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1"

Registration is now open for the workshop "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1", to be held March 31st - April 1st, 2015 at UCL School of Pharmacy in London, supported by the Human Brain Project (www.humanbrainproject.eu).

 In short, the aims of the workshop are two-fold. First, to engage the larger community of experimentalists and modelers working on hippocampus, and highlight existing modeling efforts and strategic datasets for modeling hippocampal area CA1. Second, to define and bootstrap an inclusive community-driven model and data-integration process to achieve open pre-competitive reference models of area CA1 (and, ultimately, the rest of the hippocampus), which are well documented, validated, and released at regular intervals (supported in part by IT infrastructure funded by HBP). Involvement from the community interested in characterization and modeling of the hippocampus is highly encouraged. To keep the meeting focused on the task, participation will be limited to ~30 people, so registration is required.

Please consult the meeting website at http://neuralensemble.org/meetings/HippocampCA1/for registration and further details.

Organizing committee:Jo Falck (UCL), Szabolcs Káli (Hungarian Academy of Sciences), Sigrun Lange (UCL), Audrey Mercer (UCL), Eilif Muller (EPFL), Armando Romani (EPFL) and Alex Thomson (UCL).

by eilif (noreply@blogger.com) at February 26, 2015 03:50 PM

February 24, 2015

Jake Vanderplas

Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

Why a Python Implementation?

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

  • Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

  • Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

  • Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

  • Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

Background: The Non-Uniform Fast Fourier Transform

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, \(O[N\log N]\) method of computing the discrete Fourier transform: \[ Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N} \] You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values \(y_n\) as samples of a function \(y_n = y(x_n)\) where \(x_n = x_0 + n\Delta x\) is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: \[ Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j} \] where \(y(x)\) is evaluated at an arbitrary set of points \(x_j\)? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower \(O[N^2]\) direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

Direct Non-Uniform Fourier Transform

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs \(x_j\), but compute the output on a grid of \(M\) evenly-spaced frequencies in the range \(-M/2 \le f/\delta f < M/2\). This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given \(M\), and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

In [1]:
from __future__ import print_function, division
import numpy as np

def nufftfreqs(M, df=1):
    """Compute the frequency range used in nufft for M frequency bins"""
    return df * np.arange(-(M // 2), M - (M // 2))


def nudft(x, y, M, df=1.0, iflag=1):
    """Non-Uniform Direct Fourier Transform"""
    sign = -1 if iflag < 0 else 1
    return (1 / len(x)) * np.dot(y, np.exp(sign * 1j * nufftfreqs(M, df) * x[:, np.newaxis]))

Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

Comparing to the Fortran NUFFT

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

In [2]:
# Install nufft from http://github.com/dfm/python-nufft/
from nufft import nufft1 as nufft_fortran

x = 100 * np.random.random(1000)
y = np.sin(x)

Y1 = nudft(x, y, 1000)
Y2 = nufft_fortran(x, y, 1000)

np.allclose(Y1, Y2)
Out[2]:
True

The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In [3]:
%timeit nudft(x, y, 1000)
%timeit nufft_fortran(x, y, 1000)
10 loops, best of 3: 47.9 ms per loop
1000 loops, best of 3: 265 µs per loop

On top of this, for \(N\) points and \(N\) frequencies, the Fortran NUFFT will scale as \(O[N\log N]\), while our simple implementation will scale as \(O[N^2]\), making the difference even greater as \(N\) increases! Let's see if we can do better.

NUFFT with Python

Here we'll attempt a pure-Python version of the fast, FFT-based NUFFT. We'll follow the basics of the algorithm presented on the NUFFT page, using NumPy broadcasting tricks to push loops into the compiled layer of NumPy. For later convenience, we'll start by defining a utility to compute the grid parameters as detailed in the NUFFT paper.

In [4]:
def _compute_grid_params(M, eps):
    # Choose Msp & tau from eps following Dutt & Rokhlin (1993)
    if eps <= 1E-33 or eps >= 1E-1:
        raise ValueError("eps = {0:.0e}; must satisfy "
                         "1e-33 < eps < 1e-1.".format(eps))
    ratio = 2 if eps > 1E-11 else 3
    Msp = int(-np.log(eps) / (np.pi * (ratio - 1) / (ratio - 0.5)) + 0.5)
    Mr = max(ratio * M, 2 * Msp)
    lambda_ = Msp / (ratio * (ratio - 0.5))
    tau = np.pi * lambda_ / M ** 2
    return Msp, Mr, tau


def nufft_python(x, c, M, df=1.0, eps=1E-15, iflag=1):
    """Fast Non-Uniform Fourier Transform with Python"""
    Msp, Mr, tau = _compute_grid_params(M, eps)
    N = len(x)

    # Construct the convolved grid
    ftau = np.zeros(Mr, dtype=c.dtype)
    Mr = ftau.shape[0]
    hx = 2 * np.pi / Mr
    mm = np.arange(-Msp, Msp)
    for i in range(N):
        xi = (x[i] * df) % (2 * np.pi)
        m = 1 + int(xi // hx)
        spread = np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
        ftau[(m + mm) % Mr] += c[i] * spread

    # Compute the FFT on the convolved grid
    if iflag < 0:
        Ftau = (1 / Mr) * np.fft.fft(ftau)
    else:
        Ftau = np.fft.ifft(ftau)
    Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

    # Deconvolve the grid using convolution theorem
    k = nufftfreqs(M)
    return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau

Let's compare this to the previous results. For convenience, we'll define a single routine which validates the results and times the execution:

In [5]:
from time import time

def test_nufft(nufft_func, M=1000, Mtime=100000):
    # Test vs the direct method
    print(30 * '-')
    name = {'nufft1':'nufft_fortran'}.get(nufft_func.__name__,
                                          nufft_func.__name__)
    print("testing {0}".format(name))
    rng = np.random.RandomState(0)
    x = 100 * rng.rand(M + 1)
    y = np.sin(x)
    for df in [1, 2.0]:
        for iflag in [1, -1]:
            F1 = nudft(x, y, M, df=df, iflag=iflag)
            F2 = nufft_func(x, y, M, df=df, iflag=iflag)
            assert np.allclose(F1, F2)
    print("- Results match the DFT")
    
    # Time the nufft function
    x = 100 * rng.rand(Mtime)
    y = np.sin(x)
    times = []
    for i in range(5):
        t0 = time()
        F = nufft_func(x, y, Mtime)
        t1 = time()
        times.append(t1 - t0)
    print("- Execution time (M={0}): {1:.2g} sec".format(Mtime, np.median(times)))
In [6]:
test_nufft(nufft_python)
test_nufft(nufft_fortran)
------------------------------
testing nufft_python
- Results match the DFT
- Execution time (M=100000): 1.6 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.043 sec

The good news is that our Python implementation works; the bad news is that it remains several orders of magnitude slower than the Fortran result!

Let's make it faster.

Making Code Faster: Line Profiling

We know that our Python function is slow, but we'd like to determine where this speed bottleneck lies. One convenient way to do this is with the line_profiler utility, a Python/IPython addon which can be installed using $ pip install line_profiler Once it's installed, we can load the line profiler extension into the IPython notebook using the %load_ext magic function:

In [7]:
%load_ext line_profiler

With the line profiler loaded, the %lprun magic function is now available, which we can use to profile our function line-by-line. In order to display these results here, we'll save them to file and then use %cat to view the file. The lines wrap in the notebook, making it difficult to read, so you may wish to view the raw file instead.

In [8]:
%lprun -s -f nufft_python -T lp_results.txt nufft_python(x, y, 1000)
%cat lp_results.txt

*** Profile printout saved to text file &aposlp_results.txt&apos. 
Timer unit: 1e-06 s

Total time: 0.040097 s
File: <ipython-input-4-02aa33b61f03>
Function: nufft_python at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           def nufft_python(x, c, M, df=1.0, eps=1E-15, iflag=1):
    15                                               """Fast Non-Uniform Fourier Transform with Python"""
    16         1           41     41.0      0.1      Msp, Mr, tau = _compute_grid_params(M, eps)
    17         1            3      3.0      0.0      N = len(x)
    18                                           
    19                                               # Construct the convolved grid
    20         1           19     19.0      0.0      ftau = np.zeros(Mr, dtype=c.dtype)
    21         1            2      2.0      0.0      Mr = ftau.shape[0]
    22         1            3      3.0      0.0      hx = 2 * np.pi / Mr
    23         1           11     11.0      0.0      mm = np.arange(-Msp, Msp)
    24      1001         1493      1.5      3.7      for i in range(N):
    25      1000         2801      2.8      7.0          xi = (x[i] * df) % (2 * np.pi)
    26      1000         3024      3.0      7.5          m = 1 + int(xi // hx)
    27      1000        21048     21.0     52.5          spread = np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
    28      1000        11406     11.4     28.4          ftau[(m + mm) % Mr] += c[i] * spread
    29                                           
    30                                               # Compute the FFT on the convolved grid
    31         1            2      2.0      0.0      if iflag < 0:
    32                                                   Ftau = (1 / Mr) * np.fft.fft(ftau)
    33                                               else:
    34         1          183    183.0      0.5          Ftau = np.fft.ifft(ftau)
    35         1           15     15.0      0.0      Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])
    36                                           
    37                                               # Deconvolve the grid using convolution theorem
    38         1           14     14.0      0.0      k = nufftfreqs(M)
    39         1           32     32.0      0.1      return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau

The output shows us where, line-by-line, the algorithm is spending the most time. We see that nearly 99% of the execution time is being spent in the single for loop at the center of our code. The loop is so expensive that even the FFT computation is just a trivial piece of the cost! This is actually pretty typical: due to dynamic typing, loops are generally very slow in Python.

One of the surest strategies for speeding-up your code is to use broadcasting tricks in NumPy to remove these kinds of large loops: you can read one of my course lectures on the subject here. We'll do this next.

NUFFT with NumPy Broadcasting

Let's rewrite the above implementation and use broadcasting tricks to elliminate the loops. Because of the structure of this problem, the approach is a bit complicated here, but it turns out that we can take advantage here of the little-known at() method of NumPy's ufunc (available since NumPy 1.8). Briefly,

>>> np.add.at(x, i, y)

is similar to

>>> x[i] += y

but works as desired even if the incides i have duplicate entries.

Using this, we can adjust our implementation as follows:

In [9]:
def nufft_numpy(x, y, M, df=1.0, iflag=1, eps=1E-15):
    """Fast Non-Uniform Fourier Transform"""
    Msp, Mr, tau = _compute_grid_params(M, eps)
    N = len(x)

    # Construct the convolved grid ftau:
    # this replaces the loop used above
    ftau = np.zeros(Mr, dtype=y.dtype)
    hx = 2 * np.pi / Mr
    xmod = (x * df) % (2 * np.pi)
    m = 1 + (xmod // hx).astype(int)
    mm = np.arange(-Msp, Msp)
    mpmm = m + mm[:, np.newaxis]
    spread = y * np.exp(-0.25 * (xmod - hx * mpmm) ** 2 / tau)
    np.add.at(ftau, mpmm % Mr, spread)

    # Compute the FFT on the convolved grid
    if iflag < 0:
        Ftau = (1 / Mr) * np.fft.fft(ftau)
    else:
        Ftau = np.fft.ifft(ftau)
    Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

    # Deconvolve the grid using convolution theorem
    k = nufftfreqs(M)
    return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau

Let's test it:

In [10]:
test_nufft(nufft_numpy)
test_nufft(nufft_python)
test_nufft(nufft_fortran)
------------------------------
testing nufft_numpy
- Results match the DFT
- Execution time (M=100000): 0.45 sec
------------------------------
testing nufft_python
- Results match the DFT
- Execution time (M=100000): 1.6 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.044 sec

It worked! We gained around a factor of 4 speedup in replacing the Python loop with the np.add.at() call. Still, though, we're sitting at about a factor of 10 slower than the Fortran version. The problem is that the np.add.at() call here requires construction of some very large and costly temporary arrays. If we want a faster execution time, we need to further optimize that main loop, and we can't do this with NumPy alone.

Optimization with Numba

When NumPy broadcasting tricks aren't enough, there are a few options: you can write Fortran or C code directly, you can use Cython, Weave, or other tools as a bridge to include compiled snippets in your script, or you can use a tool like Numba to speed-up your loops without ever leaving Python.

Numba is a slick tool which runs Python functions through an LLVM just-in-time (JIT) compiler, leading to orders-of-magnitude faster code for certain operations. In this case, we need to optimize what amounts to a nested for-loop, so Numba fits the bill perfectly. For clarity, we'll pull-out the grid construction code that we want to optimize, and write it as follows:

In [11]:
import numba

# nopython=True means an error will be raised
# if fast compilation is not possible.
@numba.jit(nopython=True)
def build_grid(x, c, tau, Msp, ftau):
    Mr = ftau.shape[0]
    hx = 2 * np.pi / Mr
    for i in range(x.shape[0]):
        xi = x[i] % (2 * np.pi)
        m = 1 + int(xi // hx)
        for mm in range(-Msp, Msp):
            ftau[(m + mm) % Mr] += c[i] * np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
    return ftau


def nufft_numba(x, c, M, df=1.0, eps=1E-15, iflag=1):
    """Fast Non-Uniform Fourier Transform with Numba"""
    Msp, Mr, tau = _compute_grid_params(M, eps)
    N = len(x)

    # Construct the convolved grid
    ftau = build_grid(x * df, c, tau, Msp,
                      np.zeros(Mr, dtype=c.dtype))

    # Compute the FFT on the convolved grid
    if iflag < 0:
        Ftau = (1 / Mr) * np.fft.fft(ftau)
    else:
        Ftau = np.fft.ifft(ftau)
    Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

    # Deconvolve the grid using convolution theorem
    k = nufftfreqs(M)
    return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau

Let's test this now:

In [12]:
test_nufft(nufft_numba)
test_nufft(nufft_fortran)
------------------------------
testing nufft_numba
- Results match the DFT
- Execution time (M=100000): 0.1 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.046 sec

Much better! We're now within about a factor of 3 of the Fortran speed, and we're still writing pure Python!

Having plucked all the low-hanging fruit, any further optimization will now be very low-level: that is, thinking about things like reduction of the number of exp() evaluations through application of mathematical identities. This type of careful logic is one reason the Fortran implementation is so fast, and many of these low-level strategies are discussed in the NUFFT paper linked above.

To gain some more speed, we can follow their advice and optimize the expressions at this level by precomputing expensive expressions and recombining these expressions later: This obfuscates the logic of the algorithm a bit, but it does lead to some faster execution. Here is an example of this:

In [13]:
import numba

@numba.jit(nopython=True)
def build_grid_fast(x, c, tau, Msp, ftau, E3):
    Mr = ftau.shape[0]
    hx = 2 * np.pi / Mr
    
    # precompute some exponents
    for j in range(Msp + 1):
        E3[j] = np.exp(-(np.pi * j / Mr) ** 2 / tau)
        
    # spread values onto ftau
    for i in range(x.shape[0]):
        xi = x[i] % (2 * np.pi)
        m = 1 + int(xi // hx)
        xi = (xi - hx * m)
        E1 = np.exp(-0.25 * xi ** 2 / tau)
        E2 = np.exp((xi * np.pi) / (Mr * tau))
        E2mm = 1
        for mm in range(Msp):
            ftau[(m + mm) % Mr] += c[i] * E1 * E2mm * E3[mm]
            E2mm *= E2
            ftau[(m - mm - 1) % Mr] += c[i] * E1 / E2mm * E3[mm + 1]
    return ftau


def nufft_numba_fast(x, c, M, df=1.0, eps=1E-15, iflag=1):
    """Fast Non-Uniform Fourier Transform with Numba"""
    Msp, Mr, tau = _compute_grid_params(M, eps)
    N = len(x)

    # Construct the convolved grid
    ftau = build_grid_fast(x * df, c, tau, Msp,
                           np.zeros(Mr, dtype=c.dtype),
                           np.zeros(Msp + 1, dtype=x.dtype))

    # Compute the FFT on the convolved grid
    if iflag < 0:
        Ftau = (1 / Mr) * np.fft.fft(ftau)
    else:
        Ftau = np.fft.ifft(ftau)
    Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

    # Deconvolve the grid using convolution theorem
    k = nufftfreqs(M)
    return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau

Let's test the result:

In [14]:
test_nufft(nufft_numba_fast)
test_nufft(nufft_fortran)
------------------------------
testing nufft_numba_fast
- Results match the DFT
- Execution time (M=100000): 0.06 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.047 sec

This is looking good! With a bit of effort we are now within about 25% of the Fortran speed, and we retain all the advantages of having pure Python code!

Final Timing Comparison

For good measure, let's take a look at the scaling with \(M\) for all the fast algorithms we created. We'll compute the times for a range of input sizes for each algorithm. Be aware that the following code will take several minutes to run!

In [15]:
%matplotlib inline
import matplotlib.pyplot as plt
# use seaborn for nice default plot settings
import seaborn; seaborn.set()
In [16]:
Mrange = (2 ** np.arange(3, 18)).astype(int)

t_python = []
t_numpy = []
t_numba = []
t_numba_fast = []
t_fortran = []

for M in Mrange:
    x = 100 * np.random.random(M)
    c = np.sin(x)
    
    t1 = %timeit -oq nufft_python(x, c, M)
    t2 = %timeit -oq nufft_numpy(x, c, M)
    t3 = %timeit -oq nufft_numba(x, c, M)
    t4 = %timeit -oq nufft_numba_fast(x, c, M)
    t5 = %timeit -oq nufft_fortran(x, c, M)
    
    t_python.append(t1.best)
    t_numpy.append(t2.best)
    t_numba.append(t3.best)
    t_numba_fast.append(t4.best)
    t_fortran.append(t5.best)
In [17]:
plt.loglog(Mrange, t_python, label='python')
plt.loglog(Mrange, t_numpy, label='numpy')
plt.loglog(Mrange, t_numba, label='numba #1')
plt.loglog(Mrange, t_numba_fast, label='numba #2')
plt.loglog(Mrange, t_fortran, label='fortran')
plt.legend(loc='upper left')
plt.xlabel('Number of Elements')
plt.ylabel('Execution Time (s)');

As we see, all the algorithms scale as \(\sim O[N\log N]\) in the large \(N\) limit, albeit with very different constants of proportionality. Our final optimized Numba implementation nearly matches the Fortran version as \(N\) grows large, and because it is written in pure Python, it retains all the advantages of pure Python code listed above. For that benefit, I think the cost of a ~25% slow-down is well worth it!

Conclusion

I hope you've enjoyed this exploration of how to write fast numerical code in pure Python. As you think about writing efficient implementations of useful algorithms, I invite you to consider the points I listed above: in particular, how difficult will it be for your users to install, read, modify, and contribute to your code? In the long run, this may be much more important than shaving a few milliseconds off the execution time. Writing a fast implementation of a useful algorithm is an excellent and useful pursuit, but we should be careful to not forget the costs that come along with such optimization.

If you're interested in using the pure-Python NUFFT implementation, I've adapted much of the above code in a repository at http://github.com/jakevdp/nufftpy/. It contains a packaged and unit-tested version of some of the above code, as well as a (hopefully) growing compendium of related routines.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at February 24, 2015 08:00 PM

February 23, 2015

Juan Nunez-Iglesias

jnuneziglesias

Short version

Thank you to everyone who has already submitted, retweeted, and spread the word about our book, Elegant SciPy! We are still looking for code submissions meeting these criteria:
– Submissions must use NumPy, SciPy, or a closely related library in a non-trivial way.
– Submissions must be (re)licensed as BSD, MIT, public domain, or something similarly liberal. (This is easy if you are the author.)
– Code should be satisfying in some way, such as speed, conciseness, broad applicability…
– Preferably, nominate someone else’s code that impressed you.
– Include a scientific application on real data.

Submit by one of:
– Twitter: mention @hdashnow, @stefanvdwalt, or @jnuneziglesias, or just use the hashtag #ElegantSciPy;
– Email: Stéfan van der Walt, Juan Nunez-Iglesias, or Harriet Dashnow; or
– GitHub: create a new issue here.
(All submissions will be credited as “written by” and “nominated by”.)

Long version

A big thank you to everyone that has submitted code, retweeted, posted on mailing lists, and otherwise spread the word about the book! We’re still pretty far from a book-length title though. I was also a bit vague about the kinds of submissions we wanted. I’ll elaborate a bit on each of the above points:

NumPy and SciPy use

Some excellent submissions did not use the SciPy library, but rather did amazing things with the Python standard library. I should have mentioned that the book will be specifically focused on the NumPy and SciPy libraries. That’s just the scope of the book that O’Reilly has contracted us to write. Therefore, although we might try to fit examples of great Python uses into a chapter, they are not suitable to be the centerpieces.

We will make some exceptions, for example for very closely related libraries such as pandas and scikit-learn. But, generally, the scope is SciPy the library.

Licensing

This one’s pretty obvious. We can’t use your submission if it’s under a restrictive license. And we don’t want to publish GPL-licensed code, which could force our readers to GPL-license their own code when using it. For more on this, see Choose A License, as well as Jake Vanderplas’s excellent blog post encouraging the use of the BSD license for scientific code.

Note that the author of a piece of code is free to relicense as they choose, so if you think some GPL code is perfect and might be amenable to relicensing, do let us know about it!

Submitting someone else’s code

I suspect that a lot of people are shy about submitting their own code. Two things should alleviate this. First, you can now submit via email, so you don’t have to be public about the self-promotion. (Not that there’s anything wrong with that, but I know I sometimes struggle with it.) And second, I want to explicitly state that we prefer it if you submit others’ code. This is not to discourage self-promotion, but to drive code quality even higher. It’s a high bar to convince ourselves that our code is worthy of being called elegant, but it takes another level entirely to find someone else’s code elegant! (The usual reaction when reading other people’s code is more like, “and what the #$&^ is going on here????“) So, try to think about times you saw great code during a code review, reading a blog post, or while grokking and fiddling with someone else’s library.

Data

Beautiful code is kind of a goal unto itself, but we really want to demonstrate how useful SciPy is in real scientific analysis. Therefore, although cute examples on synthetic data can illustrate quite well what a piece of code does, it would be extremely useful for us if you can point to examples with real scientific data behind them.

Thank you, and we hope to hear from you!


by Juan Nunez-Iglesias at February 23, 2015 01:31 PM

February 17, 2015

Continuum Analytics

Bokeh 0.8 Released

We are excited to announce the release of version 0.8 of Bokeh, an interactive web plotting library for Python… and other languages! This release includes many major new features:

  • New and updated language bindings: R, JavaScript, Julia, Scala, and Lua now available
  • More bokeh-server features geared towards production deployments
  • Live gallery of server examples and apps!
  • Simpler, more easily extensible design for charts API, plus new Horizon chart
  • New build automation and substantial documentation improvements
  • Shaded grid bands, configurable hover tool, and pan/zoom for categorical plots
  • Improved and more robust crossfilter application
  • AjaxDataSource for clients to stream data without a Bokeh server

by Bryan Van de Ven at February 17, 2015 08:30 AM

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr We use dask to build a parallel Python list.

Introduction

This is the seventh in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

  1. Simple task scheduling,
  2. Frontend usability
  3. A multi-threaded scheduler
  4. Matrix Multiply Benchmark
  5. Spilling to disk
  6. Slicing and Stacking

Today we take a break from ND-Arrays and show how task scheduling can attack other collections like the simple list of Python objects.

Unstructured Data

Often before we have an ndarray or a table/DataFrame we have unstructured data like log files. In this case tuned subsets of the language (e.g. numpy/pandas) aren’t sufficient, we need the full Python language.

My usual approach to the inconveniently large dump of log files is to use Python streaming iterators along with multiprocessing or IPython Parallel on a single large machine. I often write/speak about this workflow when discussing toolz.

This workflow grows complex for most users when you introduce many processes. To resolve this we build our normal tricks into a new dask.Bag collection.

Bag

In the same way that dask.array mimics NumPy operations (e.g. matrix multiply, slicing), dask.bag mimics functional operations like map, filter, reduce found in the standard library as well as many of the streaming functions found in toolz.

  • Dask array = NumPy + threads
  • Dask bag = Python/Toolz + processes

Example

Here’s the obligatory wordcount example

>>> from dask.bag import Bag

>>> b = Bag.from_filenames('data/*.txt')

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> dict(b.map(str.split).map(concat).map(stem).frequencies())
{...}

We use all of our cores and stream through memory on each core. We use multiprocessing but could get fancier with some work.

There are a lot of much larger much more powerful systems that have a similar API, notably Spark and DPark. If you have Big Data and need to use very many machines then you should stop reading this and go install them.

I mostly made dask.bag because

  1. It was very easy given the work already done on dask.array
  2. I often only need multiprocessing + a heavy machine
  3. I wanted something trivially pip installable that didn’t use the JVM

But again, if you have Big Data, then this isn’t for you.

Design

As before, a Bag is just a dict holding tasks, along with a little meta data.

>>> d = {('x', 0): (range, 5),
...      ('x', 1): (range, 5),
...      ('x', 2): (range, 5)}

>>> from dask.bag import Bag
>>> b = Bag(d, 'x', npartitions=3)

In this way we break up one collection

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

into three independent pieces

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]

When we abstractly operate on the large collection…

>>> b2 = b.map(lambda x: x * 10)

… we generate new tasks to operate on each of the components.

>>> b2.dask
{('x', 0): (range, 5),
 ('x', 1): (range, 5),
 ('x', 2): (range, 5)}
 ('bag-1', 0): (map, lambda x: x * 10, ('x', 0)),
 ('bag-1', 1): (map, lambda x: x * 10, ('x', 1)),
 ('bag-1', 2): (map, lambda x: x * 10, ('x', 2))}

And when we ask for concrete results (the call to list) we spin up a scheduler to execute the resulting dependency graph of tasks.

>>> list(b2)
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

More complex operations yield more complex dasks. Beware, dask code is pretty Lispy. Fortunately these dasks are internal; users don’t interact with them.

>>> iseven = lambda x: x % 2 == 0
>>> b3 = b.filter(iseven).count().dask
{'bag-3': (sum, [('bag-2', 1), ('bag-2', 2), ('bag-2', 0)]),
 ('bag-2', 0): (count,
                (filter, iseven, (range, 5))),
 ('bag-2', 1): (count,
                (filter, iseven, (range, 5))),
 ('bag-2', 2): (count,
                (filter, iseven, (range, 5)))}

The current interface for Bag has the following operations:

all             frequencies         min
any             join                product
count           map                 std
filter          map_partitions      sum
fold            max                 topk
foldby          mean                var

Manipulations of bags create task dependency graphs. We eventually execute these graphs in parallel.

Execution

We repurpose the threaded scheduler we used for arrays to support multiprocessing to provide parallelism even on pure Python code. We’re careful to avoid unnecessary data transfer. None of the operations listed above requires significant communication. Notably we don’t have any concept of shuffling or scatter/gather.

We use dill to take care to serialize functions properly and collect/report errors, two issues that plague naive use of multiprocessing in Python.

>>> list(b.map(lambda x: x * 10))  # This works!
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

>>> list(b.map(lambda x: x / 0))   # This errs gracefully!
ZeroDivisionError:  Execption in remote Process

integer division or modulo by zero

Traceback:
    ...

These tricks remove need for user expertise.

Productive Sweet Spot

I think that there is a productive sweet spot in the following configuration

  1. Pure Python functions
  2. Streaming/lazy data
  3. Multiprocessing
  4. A single large machine or a few machines in an informal cluster

This setup is common and it’s capable of handling terabyte scale workflows. In my brief experience people rarely take this route. They use single-threaded in-memory Python until it breaks, and then seek out Big Data Infrastructure like Hadoop/Spark at relatively high productivity overhead.

Your workstation can scale bigger than you think.

Example

Here is about a gigabyte of network flow data, recording which computers made connections to which other computers on the UC-Berkeley campus in 1996.

846890339:661920 846890339:755475 846890340:197141 168.237.7.10:1163 83.153.38.208:80 2 8 4294967295 4294967295 846615753 176 2462 39 GET 21068906053917068819..html HTTP/1.0

846890340:989181 846890341:2147 846890341:2268 13.35.251.117:1269 207.83.232.163:80 10 0 842099997 4294967295 4294967295 64 1 38 GET 20271810743860818265..gif HTTP/1.0

846890341:80714 846890341:90331 846890341:90451 13.35.251.117:1270 207.83.232.163:80 10 0 842099995 4294967295 4294967295 64 1 38 GET 38127854093537985420..gif HTTP/1.0

This is actually relatively clean. Many of the fields are space delimited (not all) and I’ve already compiled and run the decade old C-code needed to decompress it from its original format.

Lets use Bag and regular expressions to parse this.

In [1]: from dask.bag import Bag, into

In [2]: b = Bag.from_filenames('UCB-home-IP*.log')

In [3]: import re

In [4]: pattern = """
   ...: (?P<request_time>\d+:\d+)
   ...: (?P<response_start>\d+:\d+)
   ...: (?P<response_end>\d+:\d+)
   ...: (?P<client_ip>\d+\.\d+\.\d+\.\d+):(?P<client_port>\d+)
   ...: (?P<server_ip>\d+\.\d+\.\d+\.\d+):(?P<server_port>\d+)
   ...: (?P<client_headers>\d+)
   ...: (?P<server_headers>\d+)
   ...: (?P<if_modified_since>\d+)
   ...: (?P<response_header_length>\d+)
   ...: (?P<response_data_length>\d+)
   ...: (?P<request_url_length>\d+)
   ...: (?P<expires>\d+)
   ...: (?P<last_modified>\d+)
   ...: (?P<method>\w+)
   ...: (?P<domain>\d+..)\.(?P<extension>\w*)(?P<rest_of_url>\S*)
   ...: (?P<protocol>.*)""".strip().replace('\n', '\s+')

In [5]: prog = re.compile(pattern)

In [6]: records = b.map(prog.match).map(lambda m: m.groupdict())

This returns instantly. We only compute results when necessary. We trigger computation by calling list.

In [7]: list(records.take(1))
Out[7]:
[{'client_headers': '2',
  'client_ip': '168.237.7.10',
  'client_port': '1163',
  'domain': '21068906053917068819.',
  'expires': '2462',
  'extension': 'html',
  'if_modified_since': '4294967295',
  'last_modified': '39',
  'method': 'GET',
  'protocol': 'HTTP/1.0',
  'request_time': '846890339:661920',
  'request_url_length': '176',
  'response_data_length': '846615753',
  'response_end': '846890340:197141',
  'response_header_length': '4294967295',
  'response_start': '846890339:755475',
  'rest_of_url': '',
  'server_headers': '8',
  'server_ip': '83.153.38.208',
  'server_port': '80'}]

Because bag operates lazily this small result also returns immediately.

To demonstrate depth we find the ten client/server pairs with the most connections.

In [8]: counts = records.pluck(['client_ip', 'server_ip']).frequencies()

In [9]: %time list(counts.topk(10, key=lambda x: x[1]))
CPU times: user 11.2 s, sys: 1.15 s, total: 12.3 s
Wall time: 50.4 s
Out[9]:
[(('247.193.34.56', '243.182.247.102'), 35353),
 (('172.219.28.251', '47.61.128.1'), 22333),
 (('240.97.200.0', '108.146.202.184'), 17492),
 (('229.112.177.58', '47.61.128.1'), 12993),
 (('146.214.34.69', '119.153.78.6'), 12554),
 (('17.32.139.174', '179.135.20.36'), 10166),
 (('97.166.76.88', '65.81.49.125'), 8155),
 (('55.156.159.21', '157.229.248.255'), 7533),
 (('55.156.159.21', '124.77.75.86'), 7506),
 (('55.156.159.21', '97.5.181.76'), 7501)]

Comparison with Spark

First, it is silly and unfair to compare with PySpark running locally. PySpark offers much more in a distributed context.

In [1]: import pyspark

In [2]: sc = pyspark.SparkContext('local')

In [3]: from glob import glob
In [4]: filenames = sorted(glob('UCB-home-*.log'))
In [5]: rdd = sc.parallelize(filenames, numSlices=4)

In [6]: import re
In [7]: pattern = ...
In [8]: prog = re.compile(pattern)

In [9]: lines = rdd.flatMap(lambda fn: list(open(fn)))
In [10]: records = lines.map(lambda line: prog.match(line).groupdict())
In [11]: ips = records.map(lambda rec: (rec['client_ip'], rec['server_ip']))

In [12]: from toolz import topk
In [13]: %time dict(topk(10, ips.countByValue().items(), key=1))
CPU times: user 1.32 s, sys: 52.2 ms, total: 1.37 s
Wall time: 1min 21s
Out[13]:
{('146.214.34.69', '119.153.78.6'): 12554,
 ('17.32.139.174', '179.135.20.36'): 10166,
 ('172.219.28.251', '47.61.128.1'): 22333,
 ('229.112.177.58', '47.61.128.1'): 12993,
 ('240.97.200.0', '108.146.202.184'): 17492,
 ('247.193.34.56', '243.182.247.102'): 35353,
 ('55.156.159.21', '124.77.75.86'): 7506,
 ('55.156.159.21', '157.229.248.255'): 7533,
 ('55.156.159.21', '97.5.181.76'): 7501,
 ('97.166.76.88', '65.81.49.125'): 8155}

So, given a compute-bound mostly embarrassingly parallel task (regexes are comparatively expensive) on a single machine they are comparable.

Reasons you would want to use Spark

  • You want to use many machines and interact with HDFS
  • Shuffling operations

Reasons you would want to use dask.bag

  • Trivial installation
  • No mucking about with JVM heap sizes or config files
  • Nice error reporting. Spark error reporting includes the typical giant Java Stack trace that comes with any JVM solution.
  • Easier/simpler for Python programmers to hack on. The implementation is 350 lines including comments.

Again, this is really just a toy experiment to show that the dask model isn’t just about arrays. I absolutely do not want to throw Dask in the ring with Spark.

Conclusion

However I do want to stress the importance of single-machine parallelism. Dask.bag targets this application well and leverages common hardware in a simple way that is both natural and accessible to most Python programmers.

A skilled developer could extend this to work in a distributed memory context. The logic to create the task dependency graphs is separate from the scheduler.

Special thanks to Erik Welch for finely crafting the dask optimization passes that keep the data flowly smoothly.

February 17, 2015 12:00 AM

February 13, 2015

Matthew Rocklin

Towards Out-of-core ND-Arrays -- Slicing and Stacking

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr Dask.arrays can slice and stack. This is useful for weather data.

Introduction

This is the sixth in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

  1. Simple task scheduling,
  2. Frontend usability
  3. A multi-threaded scheduler
  4. Matrix Multiply Benchmark
  5. Spilling to disk

Now we talk about slicing and stacking. We use meteorological data as an example use case.

Slicing

Dask.array now supports most of the NumPy slicing syntax. In particular it supports the following:

  • Slicing by integers and slices x[0, :5]
  • Slicing by a list/np.ndarray of integers x[[1, 2, 4]]
  • Slicing by a list/np.ndarray of booleans x[[False, True, True, False, True]]

It does not currently support the following:

  • Slicing one dask.array with another x[x > 0]
  • Slicing with lists in multiple axes x[[1, 2, 3], [3, 2, 1]]

Stack and Concatenate

We often store large arrays on disk in many different files. We want to stack or concatenate these arrays together into one logical array. Dask solves this problem with the stack and concatenate functions, which stitch many arrays together into a single array, either creating a new dimension with stack or along an existing dimension with concatenate.

Stack

We stack many existing dask arrays into a new array, creating a new dimension as we go.

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # A small stack of dask arrays

>>> da.stack(arrays, axis=0).shape
(3, 4, 4)

>>> da.stack(arrays, axis=1).shape
(4, 3, 4)

>>> da.stack(arrays, axis=2).shape
(4, 4, 3)

This creates a new dimension with length equal to the number of slices

Concatenate

We concatenate existing arrays into a new array, extending them along an existing dimension

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # small stack of dask arrays

>>> da.concatenate(arrays, axis=0).shape
(12, 4)

>>> da.concatenate(arrays, axis=1).shape
(4, 12)

Case Study with Meteorological Data

To test this new functionality we download meteorological data from the European Centre for Medium-Range Weather Forecasts. In particular we have the temperature for the Earth every six hours for all of 2014 with spatial resolution of a quarter degree. We download this data using this script (please don’t hammer their servers unnecessarily) (Thanks due to Stephan Hoyer for pointing me to this dataset).

As a result, I now have a bunch of netCDF files!

$ ls
2014-01-01.nc3  2014-03-18.nc3  2014-06-02.nc3  2014-08-17.nc3  2014-11-01.nc3
2014-01-02.nc3  2014-03-19.nc3  2014-06-03.nc3  2014-08-18.nc3  2014-11-02.nc3
2014-01-03.nc3  2014-03-20.nc3  2014-06-04.nc3  2014-08-19.nc3  2014-11-03.nc3
2014-01-04.nc3  2014-03-21.nc3  2014-06-05.nc3  2014-08-20.nc3  2014-11-04.nc3
...             ...             ...             ...             ...
>>> import netCDF4
>>> t = netCDF4.Dataset('2014-01-01.nc3').variables['t2m']
>>> t.shape
(4, 721, 1440)

The shape corresponds to four measurements per day (24h / 6h), 720 measurements North/South (180 / 0.25) and 1440 measurements East/West (360/0.25). There are 365 files.

Great! We collect these under one logical dask array, concatenating along the time axis.

>>> from glob import glob
>>> filenames = sorted(glob('2014-*.nc3'))
>>> temps = [netCDF4.Dataset(fn).variables['t2m'] for fn in filenames]

>>> import dask.array as da
>>> arrays = [da.from_array(t, blockshape=(4, 200, 200)) for t in temps]
>>> x = da.concatenate(arrays, axis=0)

>>> x.shape
(1464, 721, 1440)

Now we can play with x as though it were a NumPy array.

avg = x.mean(axis=0)
diff = x[1000] - avg

If we want to actually compute these results we have a few options

>>> diff.compute()  # compute result, return as array, float, int, whatever is appropriate
>>> np.array(diff)  # compute result and turn into `np.ndarray`
>>> diff.store(anything_that_supports_setitem)  # For out-of-core storage

Alternatively, because many scientific Python libraries call np.array on inputs, we can just feed our da.Array objects directly in to matplotlib (hooray for the __array__ protocol!):

>>> from matplotlib import imshow
>>> imshow(x.mean(axis=0), cmap='bone')
>>> imshow(x[1000] - x.mean(axis=0), cmap='RdBu_r')

I suspect that the temperature scale is in Kelvin. It looks like the random day is taken during Northern Summer. Another fun one, lets look at the difference between the temperatures at 00:00 and at 12:00

>>> imshow(x[::4].mean(axis=0) - x[2::4].mean(axis=0), cmap='RdBu_r')

Even though this looks and feels like NumPy we’re actually operating off of disk using blocked algorithms. We execute these operations using only a small amount of memory. If these operations were computationally intense (they aren’t) then we also would also benefit from multiple cores.

What just happened

To be totally clear the following steps just occurred:

  1. Open up a bunch of netCDF files and located a temperature variable within each file. This is cheap.
  2. For each of those temperature variables create a da.Array object, specifying how we want to block up that variable. This is also cheap.
  3. Make a new da.Array by concatenating all of our da.Arrays for each day. This, like the other steps, is just book-keeping. We haven’t loaded data or computed anything yet.
  4. Write numpy-style code x[::2].mean(axis=0) - x[2::2].mean(axis=0). This creates yet another da.Array with a more complex task graph. It takes a few hundred milliseconds to create this dictionary.
  5. Callimshow on our da.Array object
  6. imshow calls np.array on its input, this starts the multi-core task scheduler
  7. A flurry of chunks fly out of all the netCDF files. These chunks meet various NumPy functions and create new chunks. Well organized magic occurs and an np.ndarray emerges.
  8. Matplotlib makes a pretty graph

Problems that Popped Up

The threaded scheduler is introducing significant overhead in its planning. For this workflow the single-threaded naive scheduler is actually significantly faster. We’ll have to find better solutions to reduce scheduling overhead.

Conclusion

I hope that this shows off how dask.array can be useful when dealing with collections of on-disk arrays. As always I’m very happy to hear how we can make this project more useful for your work. If you have large n-dimensional datasets I’d love to hear about what you do and how dask.array can help. I can be reached either in the comments below or at blaze-dev@continuum.io.

Acknowledgements

First, other projects can already do this. In particular if this seemed useful for your work then you should probably also know about Biggus, produced by the UK Met office, which has been around for much longer than dask.array and is used in production.

Second, this post shows off work from the following people:

  1. Erik Welch (Continuum) wrote optimization passes to clean up dask graphs before execution.
  2. Wesley Emeneker (Continuum) wrote a good deal of the slicing code
  3. Stephan Hoyer (Climate Corp) talked me through the application and pointed me to the data. If you’d like to see dask integrated with xray then you should definitely bug Stephan :)

February 13, 2015 12:00 AM

February 12, 2015

Continuum Analytics

Find Continuum Analytics at the Strata + Hadoop World Conference 2015

The Continuum Analytics team will be attending the Strata + Hadoop World Conference from February 17-20. We’re excited to be showcasing the story of Anaconda with the power of Python at Strata, and we look forward to meeting those attending.

by Continuum at February 12, 2015 12:00 AM

February 11, 2015

Matthew Rocklin

Into and Remote Data

tl;dr into now handles data on remote machines, including HDFS and the Hive Metastore (kinda).

Motivation

Last week I wrote about into, a library to migrate data between formats. We saw that a network of pairwise data conversions can robustly migrate data, eliminating some of the frustration of data science.

This frustration compounds when data lives on other computers or distributed file systems like HDFS. Moving data from your local machine into something like the Hive metastore often requires several steps.

  1. scp data to cluster
  2. hadoop fs -cp data to HDFS
  3. CREATE TABLE in Hive/Impala to register data with metastore
  4. Write SQL queries

While each of these steps may be relatively straightforward, their combination can be daunting to the casual analyst.

Remote data and into

So we took this as a case study and extended the into network appropriately. We integrate the following libraries and protocols

  • ssh://hostname:myfile.csv accesses data on remote machines through paramiko
  • hdfs://hostname:myfile.csv accesses data on the Hadoop distributed file system through WebHDFS using the pywebhdfs library
  • hive://hostname::tablename accesses data on the Hive Metastore using a combination of SQLAlchemy and hand crafted CREATE TABLE / LOAD statements

SSH

into is now a fancy scp.

>>> auth = {'username': 'alice',
...         'key_filename': '.ssh/id_rsa'}

>>> into('ssh://hostname:myfile.csv', 'myfile.csv', **auth)   # Move local file
>>> into('ssh://hostname:myfile.csv', 'myfile.json', **auth)  # Move local file

Because we’re connected to the network, lots of other things work too.

>>> df = into(pd.DataFrame, 'ssh://hostname:myfile.json', **auth)

Note that we’re not calling any code on the remote machine so fancy conversions always happen locally.

If you’d like to use ssh generally you might want to take a look at Paramiko which is really doing all of the heavy lifting here.

HDFS

WebHDFS is a web interface to the Hadoop File System. It is surprisingly high performance (I often erroneously think of HTTP as slow) but isn’t always turned on in every instance. If it is then you should be able to transfer data in and out easily, just as we did for SSH

>>> auth = {'user': 'hdfs',
...         'port': '14000'}
>>> into('hdfs://hostname:myfile.csv', 'myfile.csv', **auth)

Hive

The interesting piece comes when we come to Hive, which, in into parlance expects one of the following kinds of data:

ssh://single-file.csv
ssh://directory-of-files/*.csv
hdfs://directory-of-files/*.csv

And so we build these routes, enabling operations like the following:

>>> into('hive://hostname/default::mytable',
...      'ssh://hostname:myfile.csv' **auth)
>>> into('hive://hostname/default::mytable',
...      'ssh://hostname:mydata/*.csv' **auth)
>>> into('hive://hostname/default::mytable',
...      'hdfs://hostname:mydata/*.csv' **auth)

But Hive is also a bit finicky. Blaze uses the PyHive sqlalchemy dialect to query Hive tables; unfortunately the way Hive works we need to create them by hand. Hive is different from most databases in that it doesn’t have an internal format. Instead, it represents tables as directories of CSV files (or other things). This distinction mucks up into’s approach a bit but things work ok in normal situations.

Lessons Learned

We had to add a couple new ideas to into to expand out to these systems.

Type Modifiers

First, we needed a way to refer to different variants of the same format of file. For example, for CSV files we now have the following variants

A local CSV file
A CSV file accessible through HDFS
A CSV file accessible through SSH
A directory of CSV files
A directory of CSV files on HDFS
...

And the same for JSON, text, etc.. Into decides what conversion functions to run based on the type of the data, so in principle we need subclasses for all combinations of format and location. Yuck.

To solve this problem we create functions, SSH, HDFS, Directory to create subclasses, we call these type modifiers. So SSH(CSV) is a new type that acts like a CSV file and like the hidden _SSH superclass.

>>> SSH(CSV)('/path/to/data', delimiter=',', hostname='54.131.11.43', user='ubuntu')
>>> Directory(JSON)('/path/to/data/')

Note that users don’t usually see these (unless they want to be explicit) they usually interact with uri strings.

Temporary files

Second, we need a way to route through temporary files. E.g. consider the following route:

SSH(CSV) -> CSV -> pd.DataFrame

Both steps of this path are easy given paramiko and pandas. However we don’t want the intermediate CSV file to hang around (users would hate us if we slowly filled up their /tmp folder.) We need to clean it up when we’re done.

To solve this problem, we introduce a new type modifier, Temp, that drops itself on garbage collection (drop is another magic function in into, see docs). This lets us tie the Python garbage collector to persistent data outside of the Python process. It’s not fool-proof, but it is pretty effective.

SSH(CSV) -> Temp(CSV) -> pd.DataFrame

This is also a good example of how we build type modifiers. You can safely ignore the following code:

class _Temp(object):
    """ Temporary version of persistent storage

    >>> from into import Temp, CSV
    >>> csv = Temp(CSV)('/tmp/myfile.csv', delimiter=',')
    """
    def __del__(self):
        drop(self)

def Temp(cls):
    return type('Temp(%s)' % cls.__name__, (_Temp, cls), {'persistent_type': cls})

from toolz import memoize
Temp.__doc__ = _Temp.__doc__
Temp = memoize(Temp)

I won’t be surprised if this approach concerns a few people but I’ve found it to be effective so far.

Keyword Arguments

The number of possible keyword arguments to a single into call is increasing. We don’t have a good mechanism to help users discover the valid options for their situation. Docstrings are hard here because the options depend on the source and target inputs. For the moment we’re solving this with online documentation for each complicated format but there is probably a better solution out there.

Help!

The new behavior around ssh:// and hdfs:// and hive:// is new, error prone, and could really use play-testing. I strongly welcome feedback and error reporting here. You could file an issue or e-mail blaze-dev@continuum.io.

Other

I didn’t mention anything about S3 and RedShift support that was also recently merged. This is because I think Phil Cloud might write a separate blogpost about it. We did this work in parallel in an effort to hash out how best to solve the problems above. I think it worked decently well

Also, we’ve added an into command line interface. It works just like the into function with strings, except that we’ve reversed the order of the arguments to be more like cp. An example is below:

$ into source target --key value --key value --key value
$ into myfile.csv ssh://hostname:myfile.json --delimter ','

We also have docs! http://into.readthedocs.org/en/latest/

February 11, 2015 12:00 AM

February 09, 2015

numfocus

NumFOCUS 2014 Review

2014 Mosaic

While not represented in our blog activity, 2014 turned out to be a very busy year for NumFOCUS. We sponsored more projects, gained new board members, and added programs. It has been hard to keep up with all the activity within the organization, but perhaps much harder for those outside. Let me go through a few of the highlights from the year to get folks caught up.

First and foremost, we had the first election process. This has allowed us to welcome Lorena Barba, Brian Granger, Stefan Karpinski and Cindee Madison to the board of directors. In an effort to grow the number of people helping out we have also formed two new ways for people to contribute: Advisor Council and Vice President roles. The inaugural advisor council consists of Fernando Perez and Travis Oliphant. They will help guide the organization as it grows in both scope and operations. The new Vice President are Sheila Miguez, James Powell and Ben Zaitlen, all committed community members who have helped in various ways to get us started. We are always looking for more volunteers to help build our organization, but with these added team members we hope to continue to deliver high quality services to our community.

Over the year we held four PyData conferences, supported one John Hunter Technology Fellow, lead Women in Technology bootcamps, helped between 20 and 30 conference attendees travel to important conferences, and more. As the new year starts, we will be doing many of these things plus working with the Training Up program. We see the need to not only sponsor projects but to create important programs that get the community together. As always, we are always looking for more volunteers for these events, so please do get in touch.

Perhaps one of the most exciting things that we did this year is expand our language scope from a Python centric organization to include several non-Python projects. We have ROpenSci and Julia now a part of our line up. Joining us on the education front are Software Carpentry and sister organization Data Carpentry. We also finished signing our Pyhton projects that we had been working with for a while including: SymPy, IPython, and AstroPy. While fiscal sponsorship doesn’t make sense for every project, it is a service we happily provide to help projects sustain and grow in natural ways.

In the coming year, we plan to be a great deal more vocal about our activities. Personally, two major goals I have is to expand our reach to Europe and start making headway on project governance recommendations. But that is another story.

by Andy R. Terrel at February 09, 2015 06:00 AM

February 08, 2015

Titus Brown

Review: &quot;Accurate multi-kb reads resolve complex populations and detect rare microorganisms&quot;

Here is the top bit of a review I wrote of a very nice paper by Itai Sharon et al., from Jill Banfield's lab, on using Illumina TruSeq long reads (a.k.a. Moleculo), to look at complex metagenomes.

The paper is newly available here (although it is behind a paywall ;(.

Citation: Accurate, multi-kb reads resolve complex populations and detect rare microorganisms Genome Res. gr.183012.114. Published in Advance February 9, 2015. doi: 10.1101/gr.183012.114


This is an excellent application of new long-read technology to further illuminate the characteristics of several medium-to-high complexity microbial communities. The methods are expert, the results are fascinating, and the discussion is well done.

Objectives:

  1. test the efficacy of assembling Moleculo reads to improve short-read contigs;
  2. evaluate accuracy of curated short-read assemblies;
  3. look at organisms present at very low abundance levels;
  4. evaluate levels of sequence variation & genomic content in strains that could not otherwise be resolved by short-read assembly;

Results:

  1. Long-read data revealed many very abundant organisms...
  2. ...that were entirely missed in short-read assemblies.
  3. Genome architecture and metabolic potential were reconstructed using a new synteny based method.
  4. "Long tail" of low-abundance organisms belong to phyla represented by highly abundant organisms.
  5. Diversity of closely-related strains & rare organisms account for major portion of the communities.

The portion of the results that is most novel and most fascinating is the extensive analysis of rare sequences and the disparity in observations from Illumina (assemblies) and Moleculo (long reads and assemblies). The basic results are, on first examination, counter-intuitive: many long-read sequences are obtained from abundant organisms that simply don't show up in Illumina short-read assemblies. The statement is made that this is because of strain variation in the community, i.e. that Illumina assemblies are fragmented due to strain variation and this blocks the observation of the majority of the community. This is to some extent born out by the low mapping percentages (which is the primary evidence offered by the authors), and also matches our own observations.


--titus

by C. Titus Brown at February 08, 2015 11:00 PM

February 07, 2015

Titus Brown

How we develop software (2015 version)

A colleague who is starting their own computational lab just asked me for some advice on how to run software projects, and I wrote up the following. Comments welcome!


A brief summary of what we've converged on for our own needs is this:


--titus

by C. Titus Brown at February 07, 2015 11:00 PM

February 05, 2015

Continuum Analytics

Continuum Analytics - February Tech Events

The Continuum Analytics team believes in engaging with our community and fostering relationships throughout the tech world. To keep you all in the loop, we’re updating our blog each month with a list of events where you can find us attending, hosting, or participating!

by Continuum at February 05, 2015 12:00 PM

February 04, 2015

Juan Nunez-Iglesias

jnuneziglesias

It’s official! Harriet Dashnow, Stéfan van der Walt, and I will be writing an O’Reilly book about the SciPy library and the surrounding ecosystem. The book is called Elegant SciPy, and is intended to teach SciPy to fledgling Pythonistas, guided by the most elegant SciPy code examples we can find.

So, if you recently came across scientific Python code that made you go “Wow!” with its elegance, simplicity, cleverness, or power, please point us to it! As an example, have a look at Vighnesh Birodkar’s code to build a region adjacency graph from an n-dimensional image, which I highlighted previously here.

Each chapter will have one or two snippets that we will work towards. Each of these will be credited as “written by/nominated by”, and needs to be published under a permissive license such as MIT, BSD, or public domain to be considered for inclusion. We would especially like nominations in the following categories:

  • statistics (using scipy.stats)
  • image processing and computer vision
  • Fourier transforms
  • sparse matrix operations
  • eigendecomposition and linear algebra
  • optimization
  • streaming data analysis
  • spatial and geophysical data analysis

We’ll also consider other parts of the SciPy library and ecosystem.

We invite you to submit code snippets for inclusion in the book. We’d also appreciate a small description of about one paragraph explaining what the code is used for and why you think it’s elegant, even though this is often self-evident. =)

How to submit

Thank you,

Juan, Harriet, and Stéfan.


by Juan Nunez-Iglesias at February 04, 2015 11:23 PM

February 03, 2015

Matthew Rocklin

ReIntroducing Into

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr into efficiently migrates data between formats.

Motivation

We spend a lot of time migrating data from common interchange formats, like CSV, to efficient computation formats like an array, a database or binary store. Worse, many don’t migrate data to efficient formats because they don’t know how or can’t manage the particular migration process for their tools.

Your choice of data format is important. It strongly impacts performance (10x is a good rule of thumb) and who can easily use and interpret your data.

When advocating for Blaze I often say “Blaze can help you query your data in a variety of formats.” This assumes that you’re able to actually get it in to that format.

Enter the into project

The into function efficiently migrates data between formats. These formats include both in-memory data structures like the following:

list, set, tuple, Iterator
numpy.ndarray, pandas.DataFrame, dynd.array
Streaming Sequences of any of the above

as well as persistent data living outside of Python like the following:

CSV, JSON, line-delimited-JSON
Remote versions of the above
HDF5 (both standard and Pandas formatting), BColz, SAS
SQL databases (anything supported by SQLAlchemy), Mongo

The into project migrates data between any pair of these formats efficiently by using a network of pairwise conversions. (visualized towards the bottom of this post)

How to use it

The into function takes two arguments, a source and a target. It moves data in the source to the target. The source and target can take the following forms

Target Source Example
Object Object A particular DataFrame or list
String String 'file.csv', 'postgresql://hostname::tablename'
Type Like list or pd.DataFrame

So the following would be valid calls to into

>>> into(list, df)  # create new list from Pandas DataFrame
>>> into([], df)  # append onto existing list
>>> into('myfile.json', df)  # Dump dataframe to line-delimited JSON
>>> into(Iterator, 'myfiles.*.csv') # Stream through many CSV files
>>> into('postgresql://hostname::tablename', df)  # Migrate dataframe to Postgres
>>> into('postgresql://hostname::tablename', 'myfile.*.csv')  # Load CSVs to Postgres
>>> into('myfile.json', 'postgresql://hostname::tablename') # Dump Postgres to JSON
>>> into(pd.DataFrame, 'mongodb://hostname/db::collection') # Dump Mongo to DataFrame

Note that into is a single function. We’re used to doing this with various to_csv, from_sql methods on various types. The into api is very small; Here is what you need in order to get started:

$ pip install into
>>> from into import into

See into on github

Examples

We now show some of those same examples in more depth.

Turn list into numpy array

>>> import numpy as np
>>> into(np.ndarray, [1, 2, 3])
array([1, 2, 3])

Load CSV file into Python list

>>> into(list, 'accounts.csv')
[(1, 'Alice', 100),
 (2, 'Bob', 200),
 (3, 'Charlie', 300),
 (4, 'Denis', 400),
 (5, 'Edith', 500)]

Translate CSV file into JSON

>>> into('accounts.json', 'accounts.csv')
$ head accounts.json
{"balance": 100, "id": 1, "name": "Alice"}
{"balance": 200, "id": 2, "name": "Bob"}
{"balance": 300, "id": 3, "name": "Charlie"}
{"balance": 400, "id": 4, "name": "Denis"}
{"balance": 500, "id": 5, "name": "Edith"}

Translate line-delimited JSON into a Pandas DataFrame

>>> import pandas as pd
>>> into(pd.DataFrame, 'accounts.json')
   balance  id      name
0      100   1     Alice
1      200   2       Bob
2      300   3   Charlie
3      400   4     Denis
4      500   5     Edith

How does it work?

This is challenging. Robust and efficient conversions between any two pairs of formats is fraught with special cases and bizarre libraries. The common solution is to convert through a common format like a DataFrame, or streaming in-memory lists, dicts, etc. (see dat) or through a serialization format like ProtoBuf or Thrift. These are excellent options and often what you want. Sometimes however this can be slow, particularly when dealing with live computational systems or with finicky storage solutions.

Consider for example, migrating between a numpy.recarray and a pandas.DataFrame. We can migrate this data very quickly in place. The bytes of data don’t need to change, only the metadata surrounding them. We don’t need to serialize to an interchange format or translate to intermediate pure Python objects.

Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators through SQLAlchemy we rarely exceed migration speeds greater than 2000 records per second. However using direct CSV loaders native to PostgreSQL we can achieve speeds greater than 50000 records per second. This is the difference between an overnight job and a cup of coffee. However this requires that we’re flexible enough to use special code in special situations.

Expert pairwise interactions are often an order of magnitude faster than generic solutions.

Into is a network of these pairwise migrations. We visualize that network below:

Into's decentralized migration scheme. Complex but powerful

Each node is a data format. Each directed edge is a function that transforms data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. For example, we when migrate a CSV file to a Mongo database we might take the following route:

  • Load in to a DataFrame (pandas.read_csv)
  • Convert to np.recarray (DataFrame.to_records)
  • Then to a Python Iterator (np.ndarray.tolist)
  • Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB’s native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

To find the most efficient route we weight the edges of this network with relative costs (measured ad-hoc.) We use networkx to find the shortest path before we start the migration. If for some reason an edge fails (raises NotImplementedError) we can reroute automatically. In this way we’re both efficient and robust to failure.

Note that we color some nodes red. These nodes can be larger than memory. When we migrate between two red nodes (both the input and output may be larger than memory) then we limit our path to the red subgraph to ensure that we don’t blow up mid-migration. One format to note is chunks(...) like chunks(DataFrame) which is an iterable of in-memory DataFrames. This convenient meta-format allows us to use compact data structures like numpy arrays and pandas DataFrames on large data while keeping only a few tens of megabytes in memory at a time.

The networked approach allows developers to write specialized code for special situations and know that this code will only be used in the right situation. This approach allows us to handle a very complex problem in an isolated and separable manner. The central dispatching system keeps us sane.

History

I wrote about into long ago in connection to Blaze. I then promptly shut up about it. This was because the old implementation (before the network approach) was difficult to extend/maintain and wasn’t ready for prime-time.

I am very happy with this network. Unexpected applications very often just work and into is now ready for prime-time. It’s also available independently from Blaze, both via conda and via pip. The major dependencies are NumPy, Pandas, and NetworkX so it’s relatively lightweight for most people who read my blog. If you want to take advantage of some of the higher performing formats, like HDF5, you’ll need to install those libraries as well (pro-tip, use conda).

How do I get started?

You should download a recent version.

$ pip install --upgrade git+https://github.com/ContinuumIO/into
or
$ conda install into --channel blaze

You then might want to go through the first half of this tutorial

Or read the docs.

Or just give it a shot without reading anything. My hope is that the interface is simple enough (just one function!) that users can pick it up naturally. If you run in to issues then I’d love to hear about them at blaze-dev@continuum.io

February 03, 2015 12:00 AM

February 02, 2015

Titus Brown

khmer development sprint: Feb 19-20 and 23-25th, 2015

Michael R. Crusoe and I are throwing a sprint!

Somewhat in the vein of last year's mini-Hackathon, Michael and I and other members of the lab are going to focus in on reviewing contributions and closing issues on the khmer project for a 5 day period.

To track the sprint, subscribe to the github issue.

Michael and I will be working ~9-5pm Eastern, Thu/Fri/Mon/Tues/Wed, Feb 19-25th (so weekend excepted), and people are welcome to drop in anytime. We are planning to focus on khmer, screed, khmer-protocols, and khmer-recipes; other lab projects (like paper pipelines or workshop materials) are fair game, too. We'll have a list of issues posted for people who are looking for a small task.

We don't have any particular physical location reserved, but you're welcome to join us at Michigan State University to hang out in person. We also plan to be fully online-enabled within the 9-5pm EDT period, and will have a Google Hangout running that anyone can join. We can also set up one-to-one video conferences and screen sharing with people who need help. (We will not be working outside of those hours: family, sanity, etc.)

Here are reasons you might want to join us:

  • you're interested in seeing the "github flow" model in action, for scientific software (including automated tests, continuous integration, code review, and a pre-merge checklist!);
  • you have a particular bug or problem you want to fix in any of our software;
  • you work for or with the lab;
  • you want to see some of the technical underpinnings of our approach(es) to open science;

Again, subscribe here to be kept in the loop as our plans progress. And check out our WSSPE report on last July's hackathon!

cheers, --titus

by C. Titus Brown at February 02, 2015 11:00 PM

February 01, 2015

Titus Brown

Lab for Data Intensive Biology at UC Davis joins Software Carpentry as an Affiliate

We are pleased to announce that the Laboratory for Data Intensive Biology at UC Davis has joined the Software Carpentry Foundation as an Affiliate Member for three years, starting in January 2015.

"We've been long-term supporters of Software Carpentry, and Affiliate status lets us support the Software Carpentry Foundation in a tangible way," said Dr. C. Titus Brown, the lab director. "This status also gives us the opportunity to include Software Carpentry as part of a larger biological data science training program at UC Davis."

Software Carpentry is a volunteer organisation whose goal is to make scientists more productive, and their work more reliable, by teaching them basic computing skills. Founded in 1998, it runs short, intensive workshops that cover program design, version control, testing, and task automation. It has taught over 10,000 learners, and primarily uses a two-day workshop format.

In October 2014, a non-profit Software Carpentry Foundation was created to act as a governing body for the project, with Dr. Greg Wilson as the founding Executive Director. Affiliate status with the SCF provides preferential access to instructor training as well as waiving per-workshop fees.

The Laboratory for Data Intensive Biology (DIB) is newly started with Dr. Brown's move to UC Davis in January 2015. The lab is in the Department of Population Health and Reproduction in the School of Veterinary Medicine, and is part of both the Data Science Initiative and the Genome Center at UC Davis. DIB does fundamental and applied research on making use of the increasing volume and variety of biological data, and Dr. Brown currently has funding from the NIH, the USDA, and the Moore Foundation.

Read more about Software Carpentry at the Software Carpentry blog, and also see Software Carpentry's announcement of UC Davis' affiliate status.

by C. Titus Brown at February 01, 2015 11:00 PM

January 31, 2015

The Nipy blog

Ugly speaking

It can be miserable reading blog posts about software.

I just skip-read this post about Python 3 by Armin Ronacher. I skip-read it because I found it so ugly.

There was a BBC sitcom called One foot in the grave. It had a character called Victor Meldrew who was always angry. His catchphrase was an enraged I can't believe it.

For some reason, this genre seems to be a standard for blog posts. The author just can't believe that X (a language or operating system) could be so stupid as to do Y.

It is a little puzzling, because I guess most people reading someone saying how stupid they are, do not say "ah yes, thank you" but "f#@& you asshole". So, if we do write in that style, we make it less likely the problem will be fixed. I suppose that's obvious to the person writing, so there must be some other thing that is attractive about labeling the other person / system / idea as a loser.

We technical types don't like to think about soft issues like tone. Like the drunk under the lamp-post we prefer technical problems, and ignore other important problems such as the way we are making our readers feel.

When I am reading a blog post of the "I can't believe it" type, I feel like someone with bad breath is shouting in my face, and my first and strong reaction is to move away and avoid that person whenever I see them. For example, I suppose I won't click on a link to blog post by Armin Ronacher again, because I predict it will make me feel bad. I am sure that is a shame for me, because I have every reason to think Mr Ronacher is competent, intelligent and well-informed. It is just I don't want to feel that way. I find it harder to work when I feel that way. I am less gentle with my friends when I feel that way.

by Matthew Brett at January 31, 2015 07:44 PM

Randy Olson

Python usage survey 2014

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes in Python usage over time.

I’ll note that a big focus of this survey is to find out if Python users are transitioning over to Python 3, and if they aren’t, then why they aren’t making that transition. If you’re on the fence about switching from Python 2 to 3, there are some great articles out there about the key differences between the two versions and the many, many, …many advantages that Python 3 offers over Python 2.

If you want to check out the underlying data yourself, head on over to Bruno Cauet’s page.


The first big question that we all we know is: How many Python users have actually used Python 2 and 3?

python-survey-2014-used-2x-3x

As expected, nearly every Python user has used Python 2 at some point in their career. We also see good news for Python 3 over the past year: Python 3 usage increased 12 percentage points in 2014, up to nearly 3/4 of the surveyed Python users.


Of course, “writing code with Python 3 once” doesn’t mean that they actually use it regularly. The question below gets at that question more directly.

python-survey-2014-use-2x-3x-more

Here we see yet more good news for Python 3: As much as 10% more Python users are primarily using Python 3 than in 2013, now accounting for 1/3 of all Python users. It seems the transition to Python 3 will be slow but steady for the next few years.


The transition we see above may be caused by project managers at work. What version do Python users go to when working on their personal projects?

python-survey-2014-personal-project

Here we see a more close divide between the two versions: Nearly half of all users will start up their own projects with Python 3, whereas the other half still ardently remains pro-Python 2.


Let’s break these usage patterns down by the specific version now.

python-survey-2014-versions-regularly-use

Somehow, Python v. 2.5 and 2.6 are still in use in some places, but 2.7 still dominates the Python 2 landscape. We also see telltale signs that Python 3 is becoming a mature language in its own right, with users stuck in the older 3.2 and 3.3 versions.


To summarize so far: Over 2/3 of all Python users are still using Python 2, with the majority of them sitting at 2.7. This tells us that Python is still very much a divided language, with a large portion of the user base unwilling to upgrade to the latest version despite the fact that Python 2 nearly 5 years old now. (That’s ancient in programming language years!)

A common complaint of the ardent Python 2 users is that Python 3 was a huge mistake. How does that turn out in this survey?

python-survey-2014-python3-mistake

Surprisingly, it seems the complainers are a minority: Only 12% of the Python users surveyed think Python 3 was a mistake. 1/3 of Python users think Python 3 is great (probably the ones who made the switch!), whereas over half of Python users think the Python developers could’ve made the Python 2 -> 3 transition more fluid.


So, most Python users don’t think Python 3 was a mistake, but 2 in 3 of them still haven’t made the switch. That leaves us to wonder: Why haven’t the Python 2 users made the switch yet?

python-survey-2014-prevent-upgrade

Package dependencies are — by far — the most-cited reasons for refusing to switch to Python 3. Many of us rely on specific packages — such as numpy, scipy, pandas, etc. — for our day-to-day work. If the packages haven’t been upgraded to Python 3 yet, then why should we? Lucky for us, there’s a web site dedicated to tracking the packages that are Python 3 ready. Are all of your packages on there? Then maybe it’s time to make the switch.

Another common reason for refusing to switch to Python 3 is that there’s no incentive. We’re comfortable with Python 2, we know its ins and outs, and everything works fine. Why bother upgrading? If you fall in this category, I’ll point you to this article again that compares the key differences between Python 2 and 3. And don’t forget about the 2 pounds of reasons why Python 3 is a huge upgrade over Python 2.

The last two major reasons for refusing to switch to Python 3 are a little harder to address. If you have a large legacy code base or your manager simply refuses to make the upgrade, then it’s up to you to convince management (or yourself) that the upgrade is worth the work. The links I included in the above paragraph are a good start.


Finally, the survey wanted to look at cross-compatibility between Python 2 and 3.

python-survey-2014-ported-code-2x-3x

Here we start to see more good news for the future of Python 3: We saw a significant increase in the number of users who have ported their code from Python 2 to 3.


It’s not very well-known that there Python has packages for converting code from Python 2 to 3 and Python 3 to 2. When the survey asked users whether they’d used either of these packages even once, well, see for yourself…

python-survey-2014-used-u2to3-3to2

It seems like the Python devs need to do a better job of raising awareness of these code conversion packages.


Because Python 2 and 3 really aren’t that different, it’s not necessary to write code for just one version. As the famous taco shell commercial goes: “Por que no los dos?” (Why not both?)

To that end, the survey also polled users on whether they write Python 2- and 3- compatible code.

python-survey-2014-written-code-2x-3x-unmodified

Apparently only 1 in 3 Python developers bother to write multi-version compatible code, and there hasn’t been much of a change since 2013.


I look forward to seeing how the Python 3 transition progresses in 2014. If you have any more resources for helping (and convincing!) Python developers to make the transition, please share them in the comments below.

by Randy Olson at January 31, 2015 12:31 AM

January 29, 2015

Jan Hendrik Metzen

Advice for applying Machine Learning

Advice for applying Machine Learning

This post is based on a tutorial given in a machine learning course at University of Bremen. It summarizes some recommendations on how to get started with machine learning on a new problem. This includes

  • ways of visualizing your data
  • choosing a machine learning method suitable for the problem at hand
  • identifying and dealing with over- and underfitting
  • dealing with large (read: not very small) datasets
  • pros and cons of different loss functions.

The post is based on "Advice for applying Machine Learning" from Andrew Ng. The purpose of this notebook is to illustrate the ideas in an interactive way. Some of the recommendations are debatable. Take them as suggestions, not as strict rules.

In [1]:
import time
import numpy as np
np.random.seed(0)
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
Expand Code
# Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects
    """
    
    plt.figure()
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.legend(loc="best")
    plt.grid("on") 
    if ylim:
        plt.ylim(ylim)
    plt.title(title)

Dataset

We will generate some simple toy data using sklearn's make_classification function:

In [4]:
from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=20, n_informative=2, 
                           n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])), 
               columns = range(20) + ["class"])

Notice that we generate a dataset for binary classification consisting of 1000 datapoints and 20 feature dimensions. We have used the DataFrame class from pandas to encapsulate the data and the classes into one joint data structure. Let's take a look at the first 5 datapoints:

In [5]:
df[:5]
Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class
0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0
1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1
2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1
3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1
4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

5 rows × 21 columns

It is hard to get any clue of the problem by looking at the raw feature values directly, even on this low-dimensional example. Thus, there is a wealth of ways of providing more accessible views of your data; a small subset of these is discussed in the next section.

Visualization

First step when approaching a new problem should nearly always be visualization, i.e., looking at your data.

Seaborn is a great package for statistical data visualization. We use some of its functions to explore the data.

First step is to generate scatter-plots and histograms using the pairplot. The two colors correspond to the two classes and we use a subset of the features and only the first 50 datapoints to keep things simple.

In [6]:
_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)

Based on the histograms, we can see that some features are more helpful to distinguish the two classes than other. In particular, feature 11 and 14 seem to be informative. The scatterplot of those two features shows that the classes are almost linearly separable in this 2d space. A further thing to note is that feature 12 and 19 are highly anti-correlated. We can explore correlations more systematically by using corrplot:

In [7]:
plt.figure(figsize=(12, 10))
_ = sns.corrplot(df, annot=False)

We can see our observations from before confirmed here: feature 11 and 14 are strongly correlated with the class (they are informative). They are also strongly correlated with each other (via the class). Furthermore, feature 12 is highly anti-correlated with feature 19, which in turn is correlated with feature 14. We have thus some redundant features. This can be problematic for some classifiers, e.g., naive Bayes which assumes the features being independent given the class. The remaining features are mostly noise; they are neither correlated with each other nor with the class.

Notice that data visualization becomes more challenging if you have more feature dimensions and less datapoints. We give an example for visualiszing high-dimensional data later.

Choice of the method

Once we have visually explored the data, we can start applying machine learning to it. Given the wealth of methods for machine learning, it is often not easy to decide which method to try first. This simple cheat-sheet (credit goes to Andreas Müller and the sklearn-team) can help to select an appropriate ML method for your problem (see http://dlib.net/ml_guide.svg for an alternative cheat sheet).

In [8]:
from IPython.display import Image
Image(filename='ml_map.png', width=800, height=600) 
Out[8]: