warning: Creating default object from empty value in /var/www/drupal/modules/taxonomy/taxonomy.pages.inc on line 33.

Collection information

Details about the repository collections

July 2009 data

Hello moles, our July 2009 data has been released: this month we have Objectweb, Freshmeat, Rubyforge, Free Software Foundation directory.

Go to our Google Code pages to download the data.

The most recent datasource_ids are:
178-fm-July2009
179-rf-July2009
180-ow-July2009
181-fsf-July2009

SourceKibitzer Collections

SourceKibitzer, now defunct, was an initiative to collect metrics about the performance of various open source software products. (Here is a Wikipedia article about SourceKibitzer.)

SourceKibitzer sent FLOSSmole their data on a regular basis from February 2007 through September 2007. We dutifully stored this data and it is available for researchers to use if they are interested in the SK metrics from this time period. The datasource ids are as follows:

  • 51: 2007-Feb SourceKibitzer
  • 56: 2007-Mar SourceKibitzer
  • 62: 2007-Apr SK
  • 67: 2007-May SK
  • 73: 2007-Jun SK
  • 79: 2007-Jul SK
  • 85: 2007-Aug SK
  • 91: 2007-Sep SK

Data explanation
Here are the metrics provided for 500-odd projects by SourceKibitzer:

  • project name
  • density of comments (DC: Density of comments. Ratio of sum of the comment lines to sum of all lines in all source files of the package. Indicates how much of the code is commented.)
  • todo count (TODO_COUNT: Number of TODO comments. Sums up the number of TODO comment lines in all source files of the package. The following patterns are recognized as TODO comments: FIX-ME, FIXME, FIX-IT, FIXIT, TO-DO, TODO, XXX, TBD.)
  • commented lines of code (CLOC: Number of lines that contain comments.)

Free Software Foundation Collections

The Free Software Foundation Directory of open source projects lists those that run under "free" systems, particularly GNU and GNU/Linux variants.

Every month we collect the available project-level metadata from the Free Software Foundation's directory, and load that into our database. We then parse through those html pages and extract interesting data elements. After parsing, we save each piece of data in our database also. We then provide this data back to researchers to do with as you wish.

Project Items:

  • Developers on each project (includes maintainers, developers, and contributors)
  • Project registration date
  • Project description (textual)
  • Project interface(s)
  • Project language(s) (programming language)
  • Project license(s)
  • Project names and a unique number
  • Project URLs (both FSF URL and 'real' URL)

Frequently Asked Questions:

  1. What is the difference between this FSF directory and Savannah?
    Savannah is a code repository for free software, and includes development tools and all the things you might find in a code forge. Conversely, the FSF directory is just that, a directory of projects. We have collections from both of these forges.
  2. Why is there data missing for FSF?
    We were unable to collect from FSF for a large part of 2007 and 2008 because they were redesigning their site and it made it difficult to get the complete list of projects. Starting with data source id 142 in August 2008, we were able to begin collecting again. Another redesign happened in 2010-2011 timeframe, so we deprecated some columns and added others.

ObjectWeb Collections

OW2 Forge (formerly ObjectWeb forge) is used as a repository for projects created by developers of open source middleware in the OW2 community.

Every month we download data from their web site - we grab metadata about the projects hosted on OW2. First we construct a list of all projects, then we grab the html page for EACH project and we then save this to our database. From these pages, we parse out the relevant information about a project and save it in our database also. After the interesting variables are parsed out of the pages and stored in the database, we release the data in several different formats for you to download and use as you please.

Project Metadata:

  • Developers (name, username, etc)
  • Developers on each project (username, role on project, whether they are an admin or not
  • Project description (textual)
  • Project environment(s)
  • Project names (long name and short 'unixname')
  • Project activity percentile
  • Total number of developers displayed on page
  • Project registration date
  • Project URL
  • Project intended audience(s)
  • Project license(s)
  • Project natural language(s)
  • Project operating system(s)
  • Project programming language(s)
  • Project status (alpha, beta, etc)
  • Project topic(s)

Frequently Asked Questions:

  1. How come you don't have statistics for this forge like you do for Sourceforge?
    As of this writing, Objectweb does not publish per-project statistics like Sourceforge does.

Rubyforge Collections

Rubyforge is a repository designed to support the open source development community working in the Ruby programming language.

Every month (or so) we collect the Rubyforge list of projects and some basic developer information. We insert this data into our database, then parse out various interesting data elements and store those in the database also. We then provide this data back to you in several formats for you to do with as you please.

Project Items:

  • Developer names, emails, and whether they are an admin or not.
  • Developers on this project
  • Project descriptions
  • Project environments
  • Project names (long and short 'unixnames')
  • Total number of developers for the project
  • Project registration date
  • Project URL
  • Project intended audience(s)
  • Project license(s)
  • Project natural language(s)
  • Project operating system(s)
  • Project programming language(s)
  • Project status (alpha, beta, etc)
  • Project topic(s)

Frequently Asked Questions:

  1. Why do you not have the per-project statistics for Rubyforge that you have for Sourceforge?
    Sourceforge runs its statistics on separate server and on a per-project basis, but Rubyforge only runs some aggregate stats (which you can check out on their stats page), and some very basic chart-based stats for each project. Short answer is that the Rubyforge stats are just not as robust as the Sourceforge stats. On the other hand, the Sourceforge stats server is pretty flaky. So, there are good and bad to both systems.

Freecode (Freshmeat) Collections

Freecode (formerly Freshmeat) is a directory of open source projects.

Every month we download Freecode's own RDF file of information about projects listed on that directory, parse the information, and load it into our database. We then provide that data freely back to you to do with as you wish.

Project Metadata:

  • Project names (long name and short names)
  • Project textual descriptions
  • Project URL (the Freshmeat URL and the 'real' project URL)
  • Project license(s)
  • Project author(s) by project
  • Project stats (vitality, popularity, etc, as determined by Freshmeat)
  • Project trove categories (tags)

After the interesting variables are parsed out of the pages and stored in the database, we release the data in several different formats: flat files (delimited), SQL files, and live query db access.

Frequently Asked Questions:

  1. How come you don't provide the trove categories in the file downloads?
    We'd really like to make some sense of the trove categories, actually. Ideally, we'd like to relate each numeric trove category to a textual description of that trove category, and create a "key" table for this information. Then we'd feel more comfortable releasing the trove categories for each project. Let us know if you'd like to work on this.
  2. How come some of the data is missing from your Freshmeat downloads in early 2009?
    Mostly because Freshmeat embarked on a total site redesign during this timeframe and they stopped putting out their RDF file of project data. In mid-2009 they started putting the file out again, and we were able to begin collecting this data again in May and June, however we were told that this method was deprecated and that we'd have to start using their API to collect data. So far, we have found that the RDF files are still being produced and we're still using them.

Sourceforge Collections

Sourceforge is a large repository of open source software development projects.

From 2004-2009, approximately six times per year (every other month) FLOSSmole collected, parsed, and stored metadata about each of the projects on Sourceforge.

However, as of 2009, FLOSSmole can no longer support this effort. Instead, we recommend that researchers use the SRDA repository of SF data hosted at Notre Dame.

Project-level metadata that we collected

  • Project names (long name and short unique 'unixname')
  • Project descriptions
  • Project URLs (URL on Sourceforge and 'real'/external URL if available)
  • Project registration date
  • Project intended audience(s)
  • Project license(s)
  • Project programming language(s)
  • Project database environment(s)
  • Project operating system(s)
  • Project donor(s)
  • Project status (alpha, beta, mature, etc)
  • Project topic(s)
  • Project user interface(s)
  • Bugs, number: open/total
  • Support Requests, number: open/total
  • Patches, number: open/total
  • Rejected Patches, number: open/total
  • Smiley Themes, number: open/total
  • Translations, number: open/total
  • Themes, number: open/total
  • Feature Requests, number: open/total
  • Plugins, number: open/total
  • Public Forums, number: open/total
  • Mailing Lists, number: total
  • CVS Repositories, number: commits/reads
  • SVN Repositories, number: commits/reads

Developer metadata:
Note about developer items: we only have information on Sourceforge users (developers) associated with a project. If someone is a signed up as a Sourceforge user, but is not associated with any project, then we will not know about that person. Similarly, if a person is on a SF project in one month (say April), and then leaves the project before our next collection (say June) and does not join another project, that person will no longer appear in our data set as a developer for June even though they were in our data files for April.

  • Project developers (username, real name, Sourceforge email address)
  • Developer role(s) on project(s), including whether an administrator or not

Statistical metadata
We collect 60-day statistics for each project.

  • Project downloads (sum of project downloads over 60-day window)
  • Project ranks (project rank averaged over 60-day window)
  • Project tracker sums (sums of tracker opens and closes over 60-day window)

Frequently Asked Questions:

  1. Why did you only spider/collect every 2 months?
    The Sourceforge data sets are very large. It also takes a bit of time to perform this data collection phase. Two months seems like a good compromise between too often and too rarely. (Plus, we can synch the 60-day statistics view with our collection.)
  2. How come I get banned from Sourceforge when I try to do this collection myself?
    You've mostly likely been banned because you've violated some of the rules on the SF routers designed to stop denial of service (DOS) attacks. If SF detects that you're hitting its site too much, it will ban your IP address. We learned this the hard way too, about 5 years ago. The better solution is to work with the a data set that's already been collected, such as the FLOSSmole data or the SRDA data. (If we don't have the data you need, let us know on our mailing list and we might be able to give you some pointers about where to get it.) If you find that you absolutely must scrape SF site, follow the SF instructions for researchers.
  3. Where can I get XYZ piece of data that you don't have?
    The first thing to do is let us know which piece of data you need, because sometimes we have the data in the database, but we didn't know anyone wanted it. If you indicate that you need a piece of data, we'll certainly do our best to get it for you. In a few cases, users of our data have needed to supplement our data with stuff we didn't have. We recommend trying the Notre Dame SourceForge Research Data Archive.
Syndicate content