warning: Creating default object from empty value in /var/www/drupal/modules/taxonomy/taxonomy.pages.inc on line 33.

Collection information

Details about the repository collections

ObjectWeb Collections

OW2 Forge (formerly ObjectWeb forge) is used as a repository for projects created by developers of open source middleware in the OW2 community.

Every month we download data from their web site - we grab metadata about the projects hosted on OW2. First we construct a list of all projects, then we grab the html page for EACH project and we then save this to our database. From these pages, we parse out the relevant information about a project and save it in our database also. After the interesting variables are parsed out of the pages and stored in the database, we release the data in several different formats for you to download and use as you please.

Project Metadata:

  • Developers (name, username, etc)
  • Developers on each project (username, role on project, whether they are an admin or not
  • Project description (textual)
  • Project environment(s)
  • Project names (long name and short 'unixname')
  • Project activity percentile
  • Total number of developers displayed on page
  • Project registration date
  • Project URL
  • Project intended audience(s)
  • Project license(s)
  • Project natural language(s)
  • Project operating system(s)

Rubyforge Collections

Rubyforge is a repository designed to support the open source development community working in the Ruby programming language.

Every month (or so) we collect the Rubyforge list of projects and some basic developer information. We insert this data into our database, then parse out various interesting data elements and store those in the database also. We then provide this data back to you in several formats for you to do with as you please.

Project Items:

  • Developer names, emails, and whether they are an admin or not.
  • Developers on this project
  • Project descriptions
  • Project environments
  • Project names (long and short 'unixnames')
  • Total number of developers for the project
  • Project registration date
  • Project URL
  • Project intended audience(s)
  • Project license(s)
  • Project natural language(s)
  • Project operating system(s)
  • Project programming language(s)
  • Project status (alpha, beta, etc)
  • Project topic(s)

Frequently Asked Questions:

  1. Why do you not have the per-project statistics for Rubyforge that you have for Sourceforge?
    Sourceforge runs its statistics on separate server and on a per-project basis, but Rubyforge only runs some aggregate stats (which you can check out on their stats page), and some very basic chart-based stats for each project. Short answer is that the Rubyforge stats are just not as robust as the Sourceforge stats. On the other hand, the Sourceforge stats server is pretty flaky. So, there are good and bad to both systems.

Freecode (Freshmeat) Collections

Freecode (formerly Freshmeat) is a directory of open source projects.

Every month we download Freecode's own RDF file of information about projects listed on that directory, parse the information, and load it into our database. We then provide that data freely back to you to do with as you wish.

Project Metadata:

  • Project names (long name and short names)
  • Project textual descriptions
  • Project URL (the Freshmeat URL and the 'real' project URL)
  • Project license(s)
  • Project author(s) by project
  • Project stats (vitality, popularity, etc, as determined by Freshmeat)
  • Project trove categories (tags)

After the interesting variables are parsed out of the pages and stored in the database, we release the data in several different formats: flat files (delimited), SQL files, and live query db access.

Frequently Asked Questions:

  1. How come you don't provide the trove categories in the file downloads?
    We'd really like to make some sense of the trove categories, actually. Ideally, we'd like to relate each numeric trove category to a textual description of that trove category, and create a "key" table for this information. Then we'd feel more comfortable releasing the trove categories for each project. Let us know if you'd like to work on this.
  2. How come some of the data is missing from your Freshmeat downloads in early 2009?
    Mostly because Freshmeat embarked on a total site redesign during this timeframe and they stopped putting out their RDF file of project data. In mid-2009 they started putting the file out again, and we were able to begin collecting this data again in May and June, however we were told that this method was deprecated and that we'd have to start using their API to collect data. So far, we have found that the RDF files are still being produced and we're still using them.

Sourceforge Collections

Sourceforge is a large repository of open source software development projects.

From 2004-2009, approximately six times per year (every other month) FLOSSmole collected, parsed, and stored metadata about each of the projects on Sourceforge.

However, as of 2009, FLOSSmole can no longer support this effort. Instead, we recommend that researchers use the SRDA repository of SF data hosted at Notre Dame.

Project-level metadata that we collected

  • Project names (long name and short unique 'unixname')
  • Project descriptions
  • Project URLs (URL on Sourceforge and 'real'/external URL if available)
  • Project registration date
  • Project intended audience(s)
  • Project license(s)
  • Project programming language(s)
  • Project database environment(s)
  • Project operating system(s)
  • Project donor(s)
  • Project status (alpha, beta, mature, etc)
  • Project topic(s)
  • Project user interface(s)
  • Bugs, number: open/total
  • Support Requests, number: open/total
  • Patches, number: open/total
  • Rejected Patches, number: open/total
  • Smiley Themes, number: open/total
  • Translations, number: open/total
  • Themes, number: open/total
  • Feature Requests, number: open/total
  • Plugins, number: open/total
  • Public Forums, number: open/total
  • Mailing Lists, number: total
  • CVS Repositories, number: commits/reads
  • SVN Repositories, number: commits/reads

Developer metadata:
Note about developer items: we only have information on Sourceforge users (developers) associated with a project. If someone is a signed up as a Sourceforge user, but is not associated with any project, then we will not know about that person. Similarly, if a person is on a SF project in one month (say April), and then leaves the project before our next collection (say June) and does not join another project, that person will no longer appear in our data set as a developer for June even though they were in our data files for April.

  • Project developers (username, real name, Sourceforge email address)
  • Developer role(s) on project(s), including whether an administrator or not

Statistical metadata
We collect 60-day statistics for each project.

  • Project downloads (sum of project downloads over 60-day window)
  • Project ranks (project rank averaged over 60-day window)
  • Project tracker sums (sums of tracker opens and closes over 60-day window)

Frequently Asked Questions:

  1. Why did you only spider/collect every 2 months?
    The Sourceforge data sets are very large. It also takes a bit of time to perform this data collection phase. Two months seems like a good compromise between too often and too rarely. (Plus, we can synch the 60-day statistics view with our collection.)
  2. How come I get banned from Sourceforge when I try to do this collection myself?
    You've mostly likely been banned because you've violated some of the rules on the SF routers designed to stop denial of service (DOS) attacks. If SF detects that you're hitting its site too much, it will ban your IP address. We learned this the hard way too, about 5 years ago. The better solution is to work with the a data set that's already been collected, such as the FLOSSmole data or the SRDA data. (If we don't have the data you need, let us know on our mailing list and we might be able to give you some pointers about where to get it.) If you find that you absolutely must scrape SF site, follow the SF instructions for researchers.
  3. Where can I get XYZ piece of data that you don't have?
    The first thing to do is let us know which piece of data you need, because sometimes we have the data in the database, but we didn't know anyone wanted it. If you indicate that you need a piece of data, we'll certainly do our best to get it for you. In a few cases, users of our data have needed to supplement our data with stuff we didn't have. We recommend trying the Notre Dame SourceForge Research Data Archive.
Syndicate content