megan's blog

December data released, some updates

December data has been released. But first, some informational updates.

1. Project_indexes (the table where sourceforge web pages are stored) is now set to use the UTF-8 encoding. This is to address the concerns about character sets and corruption in our storage of Sourceforge project home pages.

2. The "forges" table is now in use. We have 5 forges, as follows:
0 - TE - test
1 - SF - Sourceforge
2 - FM - Freshmeat
3 - RF - Rubyforge
4 - OW - Objectweb

The datasources table now shows which forge the datasource is pulled from. The join column between the "forges" and "datasources" tables is "forge_id".

3. December data has been released for Sourceforge (SF), Objectweb (OW), RF (Rubyforge), FM (Freshmeat). Get the files on Sourceforge as follows:

-- All projects, SF, FM, OW, RF link
-- Sourceforge, December link (Datasource_id = 38)
-- Freshmeat, December link (Datasource_id = 41)

added "home pages"

There are 2 types of "home pages" for projects on Sourceforge:

1. A project's summary page. This is not a real home page, but sometimes people call it one. It lives on the SF servers and it has the URL format, where "projectname" is replaced by the actual name of the project. In our system, we call this address "url", and it's located in the projects table.

2. A request came in this week for us to parse out the "real" home pages of a project. There are 2 types of home pages:
a. Homepages that live on the servers and give a project a URL like this:
b. Homepages that live on some other server and are not hosted by SF in any way.

I wrote a parser for these "real urls" today and created a new column in the projects table called "real_url" to hold this data. I then released files in the sfRawData package for August 2006 and October 2006 showing these "real urls". Remember that real urls are reported by the project administrators. For the vast majority of projects, the URL is of type "a" above. But for some projects who have as their type "b" this may be of assistance in tracking down additional info about these projects.

October SF, FM, OW, RF data released

It's my Fall Break, so you know that means the October releases are finally here! (This includes Sourceforge releases, yay.)

Get the text files here.

1- FRESHMEAT fmProjectInfo (fmProjectInfo2006-Oct)
2- RUBYFORGE rfRawData (rfRawData2006-Oct)
3- SOURCEFORGE sfProjectInfo (sfProjectInfo01-oct-2006); sfRawData (sfRawData01-Oct-2006); sfRawDeveloperData (sfRawDeveloperData01-Oct-2006)
4- OBJECTWEB owRawData (osRawData2006-Oct)

September RF, FM, OW data released

The September 2006 data was released today for:


You can pick up those datafiles here on our FLOSSmole Files Page on Sourceforge.


August Data released

Go to our file release page on Sourceforge to get the latest files for August.

What's included here:

1- FRESHMEAT fmProjectInfo (fmProjectInfo2006-Aug)
2- RUBYFORGE rfRawData (rfRawData2006-Aug)
3- SOURCEFORGE sfProjectInfo (sfProjectInfo01-Aug-2006); sfRawData (sfRawData01-Aug-2006); sfRawDeveloperData (sfRawDeveloperData01-Aug-2006)
4- OBJECTWEB owRawData (osRawData2006-Aug)

rubyforge data released

Hello moles, and happy summer! I've just released Rubyforge data from July, 2006.

Now granted, Rubyforge is not as large as Sourceforge. But it has considerable "buzz" for what that's worth. And as a relatively new language and new forge, I figure it's worth watching, especially considering how easy it is to collect their data! (They put out an XML file with a bit of the data in it, and with only 1700 or so projects, it's much easier to scrape the rest than on CERTAIN OTHER FORGES. Thank you for that, Rubyforge!)

Rubyforge files available here:

Sourceforge: Number of Downloads per Project

UPDATE (2006-Jul-11): As far as I can tell, the problem below has been fixed and the "number of downloads" files are all set for you to use! Enjoy.

UPDATE (2006-Jul-10):Today, an alert user pointed out a problem with the data that I released yesterday for number of downloads. Sure enough, there was a problem with errant commas in the numeric values greater than 999. This was causing the SQL sum() to add values incorrectly for projects with large numbers of downloads. New files are being generated now, and they'll be posted shortly! Thanks for your patience. (I've removed the bad files, so for the time being the links below won't work.)

Original posting:
From the Sourceforge stats page, you can get a variety of measures, such as number of downloads, rank, etc for a particular project.

I have begun releasing these measures (summed per project, over the 60 days between SF scrapes) as Raw Downloads under the SFRawData package. Here are the links, retroactive back to December 2005:

June, 2006 (link to release)
Apr, 2006 (link to release)

CVSAnalY_SF update - project unixname

Today we got notice from the CVSAnalY folks that their data now maps to ours via unix projectname. (CVSAnalY_SF is a project that mines data from CVS repositories, such as those for projects hosted on Sourceforge.)

He writes:
You can find data, schema, explanation and known bugs at:

this data set may be more interesting to you as it includes now the
project table which allows to know the unix_name of the project (so that

June 2006 Sourceforge data released

This has to be a record! It's only the 4th of June and already the files have been posted. Hooray for 28 machines working on the alphabet! Hooray for not procrastinating!

Pick up the files from our FLOSSmole file release page on Sourceforge.

Here's what's included:
Package: sfProjectInfo
Release: sfProjectInfo01-Jun-2006
--ProjectList01-Jun-2006.csv.bz2: list of just project names
--ProjectInfo01-Jun-2006.csv.bz2: list of all basic project info (i.e. number of developers, registration dates, etc)
--ProjectDescriptions01-Jun-2006.csv.bz2: project names and their text descriptions (this file is quite large)

Package: sfRawDeveloperData
Release: sfRawDeveloperData01-Jun-2006
--sfRawDevelopers01-Jun-2006.csv.bz2: list of all developers
--sfRawDevProjects01-Jun-2006.csv.bz2: list of which projects are worked on by which developers

Package: sfRawData
Release: sfRawData01-Jun-2006
--sfRawDbEnvData01-Jun-2006.csv.bz2: list of projects and their database environments
--sfRawDonorData01-Jun-2006.csv.bz2: list of projects and their donors

Freshmeat Data Debuts

We have been collecting Freshmeat files for a long time in the database, but I finally made a release available for general consumption via the Sourceforge file release system. Get the Freshmeat March 2006 Files now! (April coming soon)

Included in this release is:

--fmRawProjectAuthors (authors/roles for each project)
--fmRawProjectDesc (textual description of each project)
