February 2010 Data Released

Lots of new data for you to peruse out on our FLOSSmole Data Downloads Page.

Here's what's out there, recently added:

Google Code, March 2010 (GC) - list of all GC projects donated by Audris Mockus (HUGE THANK YOU TO AUDRIS FOR THIS!!)
Freshmeat, February 2010 (FM)
Objectweb, February 2010 (OW)
Rubyforge, February 2010 (RF)
Github, February 2010 (GH)
Free Software Foundation, February 2010 (FSF)
Savannah, February 2010 (SV)
and Sourceforge from December 2009 (SF)

We have another set of bugs to fix with Sourceforge collection this year, 2010, but those are forthcoming. I'm running a collection now. Hopefully the data will be good. We may even have stats this time. Hallelujah.

Also, thanks to my phenomenal undergraduate superstar Steven Norris, Tigris is coming soon!! and Debian after that. We are rocking the repository collection...

December Sourceforge Data released

After long delay, the December Sourceforge data has been released. You may recall that over summer 2009, SF redesigned their web site which broke many of our crawlers and all of our parsers.

We have re-written these, and with only a few exceptions, have pretty much the same data as we always had.

Here are some release notes:

1. The Datasource_id=206
2. Donors data is not available in the Dec 2009 release. Donors were moved to their own page, so we have to add this to the collection for next time.
3. Statistics data is not available in the Dec 2009 release. We accidentally collected the wrong stats pages, so we had to throw these out and re-write for next time.
4. Status data (alpha, beta, mature, etc) is not available in the Dec 2009 release. This information is still being collected and kept by SF, but we can't find where it's being reported on their web site. If you have any ideas, send them to the mailing list (ossmole-discuss@lists.sourceforge.net).

Files are located at our Google Code page: http://code.google.com/p/flossmole/downloads/list

For those of you with database access on the sdsc server, I'll get these files over there ASAP.

December 2009 data released

December data has been released for the following forges:

(datasource-abbreviation-full name)
203-fsf-free software foundation

Sourceforge is in progress... it will be datasource_id=206.

Get the data here:

Remember that the files marked "DM" are SQL files (mysql) but the files marked .txt are flat text files (delimited)

November 2009 data released

This month we have data from Freshmeat, Rubyforge, Objectweb, Savannah, Github, Free Software Foundation.

Downloads available at Google Code

Remember, the SQL is available in the datamart*.sql.bz files, the flat (delimited) data is available in the other files.

We're still working on getting our Sourceforge scraper back up and running, and we thank you for your patience.

October 2009 data released

October 2009 data has been released. Here are the forges we have this month:
Free Software Foundation directory
Savannah (new)
GitHub (new)

FLOSSmole Downloads

Sourceforge is undergoing a re-write, still, but we will be collecting again from there soon. In the meantime, don't forget that the June 2009 data is available, and also there is the Notre Dame data if you find that helps at all.


September 2009 data released

Data has been released for FSF, FM, RF, OW. Go get it!! Have fun.

Google Code Downloads Page

That Freshmeat data looks fairly popular. Anyone want to tell us how you use this data?

Savannah data available

Savannah data has been released for July. See what you think! (Datasource_id = 182)

July 2009 data

Hello moles, our July 2009 data has been released: this month we have Objectweb, Freshmeat, Rubyforge, Free Software Foundation directory.

Go to our Google Code pages to download the data.

The most recent datasource_ids are:

SourceKibitzer Collections

SourceKibitzer, now defunct, was an initiative to collect metrics about the performance of various open source software products. (Here is a Wikipedia article about SourceKibitzer.)

SourceKibitzer sent FLOSSmole their data on a regular basis from February 2007 through September 2007. We dutifully stored this data and it is available for researchers to use if they are interested in the SK metrics from this time period. The datasource ids are as follows:

  • 51: 2007-Feb SourceKibitzer
  • 56: 2007-Mar SourceKibitzer
  • 62: 2007-Apr SK
  • 67: 2007-May SK
  • 73: 2007-Jun SK
  • 79: 2007-Jul SK
  • 85: 2007-Aug SK
  • 91: 2007-Sep SK

Data explanation
Here are the metrics provided for 500-odd projects by SourceKibitzer:

  • project name
  • density of comments (DC: Density of comments. Ratio of sum of the comment lines to sum of all lines in all source files of the package. Indicates how much of the code is commented.)
  • todo count (TODO_COUNT: Number of TODO comments. Sums up the number of TODO comment lines in all source files of the package. The following patterns are recognized as TODO comments: FIX-ME, FIXME, FIX-IT, FIXIT, TO-DO, TODO, XXX, TBD.)
  • commented lines of code (CLOC: Number of lines that contain comments.)

Free Software Foundation Collections

The Free Software Foundation Directory of open source projects lists those that run under "free" systems, particularly GNU and GNU/Linux variants.

Every month we collect the available project-level metadata from the Free Software Foundation's directory, and load that into our database. We then parse through those html pages and extract interesting data elements. After parsing, we save each piece of data in our database also. We then provide this data back to researchers to do with as you wish.

Project Items:

  • Developers on each project (includes maintainers, developers, and contributors)
  • Project registration date
  • Project description (textual)
  • Project interface(s)
  • Project language(s) (programming language)
  • Project license(s)
  • Project names and a unique number
  • Project URLs (both FSF URL and 'real' URL)

Frequently Asked Questions:

  1. What is the difference between this FSF directory and Savannah?
    Savannah is a code repository for free software, and includes development tools and all the things you might find in a code forge. Conversely, the FSF directory is just that, a directory of projects. We have collections from both of these forges.
  2. Why is there data missing for FSF?
    We were unable to collect from FSF for a large part of 2007 and 2008 because they were redesigning their site and it made it difficult to get the complete list of projects. Starting with data source id 142 in August 2008, we were able to begin collecting again. Another redesign happened in 2010-2011 timeframe, so we deprecated some columns and added others.
