FLOSSmole now includes FSF data

FLOSSmole now includes data from the FSF (Free Software Foundation) directory (original directory link).

The flat files including the data can be found on our FSF sourceforge file release page.

Some facts of note:
--the FSF directory contains 5226 projects
--the FSF directory allows projects with case-sensitive but otherwise identical names, i.e. ANT and ant are considered different projects
--our datasource_id for this initial run is "45" if anyone is checking the query tool
--FSF is forge #6 in our FLOSSmole system
--our FSF tables are all preceded with the "fsf" extension (i.e. "fsf_projects").

January FM, RF, OW files released

We've released the January 2007 files for Freshmeat, Objectweb, and Rubyforge.

In February, look forward to the next Sourceforge release, as well as a new feature: "All Time Stats" for Sourceforge!

Get the latest files here

One more newish thing... if you're on the mailing list, this is old news... but for you data junkies, you can now play around with the content generated about our data model from schema spy. If we like this format, I'll generate these bi-monthly along with the SF data and post them on the sidebar permanently.

As always, if you feel like helping out the team with coding or documentation, send me an email and we'll chat. (mconklin at elon dot edu)

December data released, some updates

December data has been released. But first, some informational updates.

1. Project_indexes (the table where sourceforge web pages are stored) is now set to use the UTF-8 encoding. This is to address the concerns about character sets and corruption in our storage of Sourceforge project home pages.

2. The "forges" table is now in use. We have 5 forges, as follows:
0 - TE - test
1 - SF - Sourceforge
2 - FM - Freshmeat
3 - RF - Rubyforge
4 - OW - Objectweb

The datasources table now shows which forge the datasource is pulled from. The join column between the "forges" and "datasources" tables is "forge_id".

3. December data has been released for Sourceforge (SF), Objectweb (OW), RF (Rubyforge), FM (Freshmeat). Get the files on Sourceforge as follows:

-- All projects, SF, FM, OW, RF link
-- Sourceforge, December link (Datasource_id = 38)
-- Freshmeat, December link (Datasource_id = 41)

added "home pages"

There are 2 types of "home pages" for projects on Sourceforge:

1. A project's summary page. This is not a real home page, but sometimes people call it one. It lives on the SF servers and it has the URL format http://sf.net/projects/projectname, where "projectname" is replaced by the actual name of the project. In our system, we call this address "url", and it's located in the projects table.

2. A request came in this week for us to parse out the "real" home pages of a project. There are 2 types of home pages:
a. Homepages that live on the shell.sf.net servers and give a project a URL like this: http://projectname.sf.net
b. Homepages that live on some other server and are not hosted by SF in any way.

I wrote a parser for these "real urls" today and created a new column in the projects table called "real_url" to hold this data. I then released files in the sfRawData package for August 2006 and October 2006 showing these "real urls". Remember that real urls are reported by the project administrators. For the vast majority of projects, the URL is of type "a" above. But for some projects who have as their type "b" this may be of assistance in tracking down additional info about these projects.

October SF, FM, OW, RF data released

It's my Fall Break, so you know that means the October releases are finally here! (This includes Sourceforge releases, yay.)

Get the text files here.

1- FRESHMEAT fmProjectInfo (fmProjectInfo2006-Oct)
2- RUBYFORGE rfRawData (rfRawData2006-Oct)
3- SOURCEFORGE sfProjectInfo (sfProjectInfo01-oct-2006); sfRawData (sfRawData01-Oct-2006); sfRawDeveloperData (sfRawDeveloperData01-Oct-2006)
4- OBJECTWEB owRawData (osRawData2006-Oct)

Check the release notes for more guidance on what is inside each file. Also the "how to use this data" posting might be helpful. (Even though the original date on this posting is "April 2005", I continue to update it, almost like a mini-FAQ.)

September RF, FM, OW data released

The September 2006 data was released today for:


You can pick up those datafiles here on our FLOSSmole Files Page on Sourceforge.


August Data released

Go to our file release page on Sourceforge to get the latest files for August.

What's included here:

1- FRESHMEAT fmProjectInfo (fmProjectInfo2006-Aug)
2- RUBYFORGE rfRawData (rfRawData2006-Aug)
3- SOURCEFORGE sfProjectInfo (sfProjectInfo01-Aug-2006); sfRawData (sfRawData01-Aug-2006); sfRawDeveloperData (sfRawDeveloperData01-Aug-2006)
4- OBJECTWEB owRawData (osRawData2006-Aug)

Check the release notes for more guidance on what is inside each file. Also the "how to use this data" posting might be helpful. (Even though the original date on this posting is "April 2005", I continue to update it, almost like a mini-FAQ.)


rubyforge data released

Hello moles, and happy summer! I've just released Rubyforge data from July, 2006.

Now granted, Rubyforge is not as large as Sourceforge. But it has considerable "buzz" for what that's worth. And as a relatively new language and new forge, I figure it's worth watching, especially considering how easy it is to collect their data! (They put out an XML file with a bit of the data in it, and with only 1700 or so projects, it's much easier to scrape the rest than on CERTAIN OTHER FORGES. Thank you for that, Rubyforge!)

Rubyforge files available here:

Unfortunately, even though RF is using the same software as SF, they don't have donation system (so no donor files), and they don't have a statistics engine like SF. So the statistics are a little weak.

One other note, along with Freshmeat, Rubyforge will be scraped MONTHLY. Sourceforge will continue to be scraped BI-MONTHLY (every other month). This is due to the size and complexity of the SF scrape.

ObjectWeb coming next.

Sourceforge: Number of Downloads per Project

UPDATE (2006-Jul-11): As far as I can tell, the problem below has been fixed and the "number of downloads" files are all set for you to use! Enjoy.

UPDATE (2006-Jul-10):Today, an alert user pointed out a problem with the data that I released yesterday for number of downloads. Sure enough, there was a problem with errant commas in the numeric values greater than 999. This was causing the SQL sum() to add values incorrectly for projects with large numbers of downloads. New files are being generated now, and they'll be posted shortly! Thanks for your patience. (I've removed the bad files, so for the time being the links below won't work.)

Original posting:
From the Sourceforge stats page, you can get a variety of measures, such as number of downloads, rank, etc for a particular project.

I have begun releasing these measures (summed per project, over the 60 days between SF scrapes) as Raw Downloads under the SFRawData package. Here are the links, retroactive back to December 2005:

June, 2006 (link to release)
Apr, 2006 (link to release)

CVSAnalY_SF update - project unixname

Today we got notice from the CVSAnalY folks that their data now maps to ours via unix projectname. (CVSAnalY_SF is a project that mines data from CVS repositories, such as those for projects hosted on Sourceforge.)

He writes:
You can find data, schema, explanation and known bugs at:


this data set may be more interesting to you as it includes now the
project table which allows to know the unix_name of the project (so that
you can link data from here with FLOSSMole, among others).

So, if you've been using our FLOSSmole data, you know that each Sourceforge project has a unique project "unixname". Well, you can now grab our data, and grab the CVSAnalY_SF data, and use them together to create even more interesting data sets.
Syndicate content