Fortran in scipy (or the bus factor in some python projects)

A couple of weeks ago I was at Python in astronomy. #pyastro is a great place to meet key people behind of a lot of astro-related projects and to learn and share knowledge with a lot of pycurious astronomers - I recommend you to read Sophie’s wonderful summary of the conference.

The morning before we started the hackday someone mentioned Fortran and how much is used in SciPy (25.6%) as shown in its GitHub page.

scipy stats

That made me think… How many people do actually know what’s going on in these files? I clicked there, so it took me to the list of files and opened a few random files. Surprise! many of these have only one contributor. That made me think even more… can I get some metrics about these files? Next morning we had to pitch our ideas for the hack day and, of course, I pitched this.

Only Zé joined me and between both of us worked our way around the git API in python. So we created our Buss tool. The tool, is quite simple, after you download the repo you want to analyse and run it, it gives you some stats about the repository such as the number and list of files that have only one contributor and when was the last contribution to the repository made for said people.

In my opinion, this simplistic approach shows only the best scenario of the cruel reality. For example, there may be files with two or more contributors where there’s only one person who actually understand what’s in that file and the others’ contributions may have been changes to the code formatting within the file.

Therefore, if what git tells is the truth (and our program is correct - we need tests!), then you should take care of some of the people shown as only contributors of many files. Why don’t you get them a beer in exchange to explain you what’s going on in these files?

I also need to say that many of these files may be so trivial that no one has had the need to modified or add anything to them. Maybe new versions of Buss include some better ways of measuring that. In any case I’ve tried my best to remove known files such as __init__.py.

Let’s go with the results!

For the conference I produced some figures for four projects, SciPy (it was who started it all!), NumPy, Astropy and my beloved SunPy. Now for here I’m expanding this out to all NumFOCUS python projects and other python communities I am familiar with so to get a better picture.

How many files has only been touched by one person.

This pie charts are showing the percentage of files that has only been contributed by one author. Starting with the four main projects, we found the following:

For the NumFOCUS python projects (IPython, matplotlib, pandas, Pymc3, Stan, PyTables, QuantEcon and yt):

and the final four (django, Scikit-image, Scikit-learn and SQLAlchemy):

It’s satisfying to find that most of the projects have less than 25% of the files only controlled by one person. But we should all aim to have a better cover as matplotlib.

Are they active?

For each of these projects we’ve visualised each author and how many files they are the only contributor, coding in colour the years that passed without they committing to the repository. Ideally we should obtain the information from GitHub to get whether they’ve been active discussing issues and reviewing pull requests.

If you know someone named towards the bottom of the plots, and they are in yellow means that the project need to find quickly some replacement. If the person at the bottom has a lot of files and it’s not so yellowish, then try to take some time to work on these files.

Here’s the list by project:

Next steps

I would say the first obvious step without adding much complexity it could be to analyse the content of these critic files by language, purpose (are they code or tests?), etc.

If we want to go into more detail we could look at:

the lines of code instead of the number of files may be a better indicative,
the activity of these contributors to the project (are they still active?),
the files with other contributors where only formatting changes have be applied (I would have no idea how to test for that),
and finally have a better way to select which files to count or not (by now I’m only excluding __init__.py and setup_package.py).

Also, notice that buss is only looking to the code directory, not the docs, data, scripts or other external directories.

David PS