Graphs of developer counts over time

As an example of the data and analysis in the system here is a graphic of developer counts over time, taken from the Project Summaries pages, developed by James Howison and Kevin Crowston using the OSSmole data. The time series are sorted, programatically, into 6 categories, from constantly rising, mostly rising, not trending, mostly falling, consistently falling and dead projects.

This picture only shows curves and categories for a sample of 120 projects. This can be compared against the categorizations of the total population of Sourceforge projects as is shown in the histogram.

The sample of 120 projects has substantially more consistently rising projects, so it seems clear that that sample is generally more successful. The large unchanging (or not trending) category in the total population reflects the fact that the mode is NA,NA,1,1,1 and our finding that 65,561 of the total 98,568 projects (ie 67%) seen over 5 years have never had more than 1 developer.

The latest Database schema

Megan and I have done some work on commenting the proposed database schema, explaining what each field is and why it is there. The schema is in CVS and is available via the web interface. It is easier to read in an editor capable of syntax coloring for mysql.

We'd very much appreciate feedback on the generality, or lack there of, and coverage of people's interest areas. Best place is the ossmole-discuss mailing list.

using 'free' and 'open' to name Sourceforge projects

The following graph shows the relative popularity of the words 'free' and 'open' in naming new Sourceforge projects 11-1999 through 12-2004. Note that 1999 only includes 2 months worth of new project registration data (November and December), which is why the 1999 totals are much lower than the other years represented on the chart. However, 10 new projects in 1999 had 'free' in their names, while 'open' had only 9. In looking at the chart, we might surmise that during the years 2000-2001, 'open' became more the preferred term over 'free'.

click to enlarge

Limitations: Note that the words 'free' and 'open' are found at the beginning, middle and end of Sourceforge project names - so, for example, 'canopen' and 'bugfree' are both projects that are included in this count.

donated data, yay!

The OSSmole team has successfully imported data from Dawid Weiss' crawl of Sourceforge from December 2004. (Moles: This information has datasourceID=4 in the database.) Thanks, Dawid, for making your data available and for donating it to this project!

Anyone else who wants to be a mole: if you have data from ANY open source repository, ANY time frame, please let us know if you'd like to donate it. You can email Megan Conklin (mconklin AT elon DOT edu) or James Howison (jhowison AT syr DOT edu) or hop on IRC ( #ossmole) to chat about what you have, and how it can be integrated into the OSSmole repository.

the good and the bad news

There's good news and there's bad news. The bad news is that we've found some problems with the developer data collected during the October 2004 run, namely that the last half of the letter 'z' (specifically project unixnames > 'zin') weren't collected. This means that there could be other problems lurking under the surface of the data for the October run, such as other missing chunks of information. Yuck.

The good news is multifold:
(a) we found the problem (yay);
(b) we have an active developer community that is using the data for real problems and is able to fix things like this when they arise;
(c) the newer developer runs (January) don't seem to be affected;
(d) we're adding in some donated data soon that will help fill in holes like this;
(e) we've got a brand new collection engine in alpha right now that will get the runs done faster and more accurately, thus reducing these risks in the future!

a september pattern at sourceforge?

Last September, right about the time we started up OSSmole, Sourceforge sent out a monthly email newsletter that included this observation:

(9/20/2004) Welcome to the September sitewide email. September is typically our busiest month for new traffic on Students are arriving at college and getting on high speed connections. Open Source developers and consumers of Open Source software are returning from
their summer vacations. If you are back from vacation, it's good to have you back.

I was of course reminded of The Long September on usenet. Early participants on usenet began noticing that every September a new wave of cluless college students would flood in and ask dumb questions and make life miserable for a couple of months each year (until 1993 when usenet was made available to AOL users and so-called "The Long September" was born).

The SF message above talks about lots of "new traffic" on during September, but I'm not sure how "new traffic" is defined. It could mean generic web site traffic, as in "new users visiting the web site". Or it could mean "new projects being built", or it could mean "new users signing up". Or it could be some vestiges of September memories from usenet. Or, most likely, some combination of all of these things.

some graphs

I'm experimenting with making some graphs of the data we've collected.

Here is a graph of the growth in programming languages used on Sourceforge projects from October 2004 until January 2005.

click for a full-size image

Here is a graph showing the growth in the number of projects added per month to the Sourceforge repository from November 1999 until January 2005.

click for a full-size image

Here is a graph showing growth in total numbers of Sourceforge projects, by month, from November 1999 until January 2005.

click for a full-size image

January 2005 Summary Reports

The January 2005 Sourceforge summary reports have been posted:

sourceforge raw data files released

We have issued a new release of raw data files on sourceforge projects.

  • Raw data: full lists of projects and programming languages used, operating systems used, target user interfaces, etc. The information included is for the October 2004 run and the January 2005 run. Download raw data files here
  • Raw developer data: lists of all developers; list of projects and developers on each. The information included is for the October 2004 run and the January 2005 run. Download raw developer files here .

full project list released

We have released the full list of sourceforge project names as of 28-Jan-2005. You can get the list here.
