Blog

rubyforge data released

Hello moles, and happy summer! I've just released Rubyforge data from July, 2006.

Now granted, Rubyforge is not as large as Sourceforge. But it has considerable "buzz" for what that's worth. And as a relatively new language and new forge, I figure it's worth watching, especially considering how easy it is to collect their data! (They put out an XML file with a bit of the data in it, and with only 1700 or so projects, it's much easier to scrape the rest than on CERTAIN OTHER FORGES. Thank you for that, Rubyforge!)

Rubyforge files available here:

Sourceforge: Number of Downloads per Project

UPDATE (2006-Jul-11): As far as I can tell, the problem below has been fixed and the "number of downloads" files are all set for you to use! Enjoy.

UPDATE (2006-Jul-10):Today, an alert user pointed out a problem with the data that I released yesterday for number of downloads. Sure enough, there was a problem with errant commas in the numeric values greater than 999. This was causing the SQL sum() to add values incorrectly for projects with large numbers of downloads. New files are being generated now, and they'll be posted shortly! Thanks for your patience. (I've removed the bad files, so for the time being the links below won't work.)

Original posting:
=================
From the Sourceforge stats page, you can get a variety of measures, such as number of downloads, rank, etc for a particular project.

I have begun releasing these measures (summed per project, over the 60 days between SF scrapes) as Raw Downloads under the SFRawData package. Here are the links, retroactive back to December 2005:

June, 2006 (link to release)
Apr, 2006 (link to release)

CVSAnalY_SF update - project unixname

Today we got notice from the CVSAnalY folks that their data now maps to ours via unix projectname. (CVSAnalY_SF is a project that mines data from CVS repositories, such as those for projects hosted on Sourceforge.)

He writes:
You can find data, schema, explanation and known bugs at:

http://libresoft.urjc.es/Data/CVSAnalY_SF

this data set may be more interesting to you as it includes now the
project table which allows to know the unix_name of the project (so that

June 2006 Sourceforge data released

This has to be a record! It's only the 4th of June and already the files have been posted. Hooray for 28 machines working on the alphabet! Hooray for not procrastinating!

Pick up the files from our FLOSSmole file release page on Sourceforge.

Here's what's included:
Package: sfProjectInfo
Release: sfProjectInfo01-Jun-2006
Files:
--ProjectList01-Jun-2006.csv.bz2: list of just project names
--ProjectInfo01-Jun-2006.csv.bz2: list of all basic project info (i.e. number of developers, registration dates, etc)
--ProjectDescriptions01-Jun-2006.csv.bz2: project names and their text descriptions (this file is quite large)


Package: sfRawDeveloperData
Release: sfRawDeveloperData01-Jun-2006
Files:
--sfRawDevelopers01-Jun-2006.csv.bz2: list of all developers
--sfRawDevProjects01-Jun-2006.csv.bz2: list of which projects are worked on by which developers


Package: sfRawData
Release: sfRawData01-Jun-2006
Files:
--sfRawDbEnvData01-Jun-2006.csv.bz2: list of projects and their database environments
--sfRawDonorData01-Jun-2006.csv.bz2: list of projects and their donors

Freshmeat Data Debuts

We have been collecting Freshmeat files for a long time in the database, but I finally made a release available for general consumption via the Sourceforge file release system. Get the Freshmeat March 2006 Files now! (April coming soon)

Included in this release is:

--fmRawProjectAuthors (authors/roles for each project)
--fmRawProjectDesc (textual description of each project)

April Sourceforge data released

The April Sourceforge data has been released. You can pick up the files from our Sourceforge file release page.

Here's what's included:
Package: sfProjectInfo
Release: sfProjectInfo02-Apr-2006
Files:
--ProjectList02-Apr-2006.csv.bz2: list of just project names
--ProjectInfo02-Apr-2006.csv.bz2: list of all basic project info (i.e. number of developers, registration dates, etc)
--ProjectDescriptions02-Apr-2006.csv.bz2: project names and their text descriptions (this file is quite large)

Package: sfRawDeveloperData
Release: sfRawDeveloperData02-Apr-2006
Files:
--sfRawDevelopers02-Apr-2006.csv.bz2: list of all developers
--sfRawDevProjects02-Apr-2006.csv.bz2: list of which projects are worked on by which developers

Package: sfRawData
Release: sfRawData02-Apr-2006
Files:
--sfRawDbEnvData02-Apr-2006.csv.bz2: list of projects and their database environments
--sfRawDonorData02-Apr-2006.csv.bz2: list of projects and their donors
--sfRawIntAudData02-Apr-2006.csv.bz2: list of projects and their intended audiences
--sfRawLicenseData02-Apr-2006.csv.bz2: list of projects and their open source licenses
--sfRawOpSysData02-Apr-2006.csv.bz2: list of projects and their operating systems

Social Network analysis over time using FLOSSmole data

Just sent off the camera ready version of a paper built using data available in the tracker tables of the FLOSSmole database.

Howison, J., Inoue, K., and Crowston, K. (2006). Social dynamics of free and open source team communications. In Proceedings of the IFIP 2nd International Conference on Open Source Software, Lake Como, Italy. Available from: http://floss.syr.edu/publications/howison_dynamic_sna_intoss_ifip_short.pdf

This paper furthers inquiry into the social structure of free and open source software (FLOSS) teams by undertaking social network analysis across time. Contrary to expectations, we confirmed earlier findings of a wide distribution of centralizations even when examining the networks over time. The paper also provides empirical evidence that while change at the center of FLOSS projects is relatively uncommon, participation across the project communities is highly skewed, with many participants appearing for only one period. Surprisingly, large project teams are not more likely to undergo change at their centers.

February 2006 files released

Sourceforge data has been released for February, 2006. Get the files from our Sourceforge file release page.

What's included in this release:

Package: sfProjectInfo
Release: sfProjectInfo02-Feb-2006
Files:
--ProjectList02-Feb-2006.csv.bz2: list of just project names
--ProjectInfo02-Feb-2006.csv.bz2: list of all basic project info
--ProjectDescriptions02-Feb-2006.csv.bz2: project names and their text descriptions (this file is quite large)

Package: sfRawDeveloperData
Release: sfRawDeveloperData02-Feb-2006
Files:
--developers02-Feb-2006.csv.bz2: list of all developers
--developer_projects02-Feb-2006.csv.bz2: list of which projects are worked on by which developers

Package: sfRawData
Release: sfRawData02-Feb-2006
Files:
--project_dbenv02-Feb-2006.csv.bz2: list of projects and their database environments
--project_donors02-Feb-2006.csv.bz2: list of projects and their donors
--project_intaud02-Feb-2006.csv.bz2: list of projects and their intended audiences
--project_licenses02-Feb-2006.csv.bz2: list of projects and their open source licenses
--project_opsys02-Feb-2006.csv.bz2: list of projects and their operating systems

tips for using the query tool

NOTE: This message describes an old query tool. The old query tool has been replaced with the new query tool. The new tool is located here: New Query Tool


Original message:
If you use the query tool, be aware that the amount of data in some of our tables is truly immense.

Tips:

1. do a "describe" on each table to see what's in there first:

"describe fm_projects"

This will tell you the structure of the table.

2. If you want to see a few sample rows, and you feel as though you simply MUST do a "select *", at least do your select with a mysql-style "limit" phrase like this:

"select * from fm_projects limit 25"

3. If you get an error describing something like a "timeout", this means your query was probably just too large. Email or chat with us on IRC or AIM to figure out what is wrong or a way to optimize the query.

4. Use the text files - many of the queries you want are the same queries that everyone wants! So we've taken the liberty of making text files of these items for your convenience.

tidbit: freshmeat and sourceforge

Freshmeat (FM) describes itself thusly: "freshmeat maintains the Web's largest index of Unix and cross-platform software, themes and related 'eye-candy', and Palm OS software."

And Sourceforge (SF) is, of course, "the world's largest Open Source software development web site, hosting more than 100,000 projects and over 1,000,000 registered users with a centralized resource for managing projects, issues, communications, and code."

Here at FLOSSmole, we keep tabs on Freshmeat AND Sourceforge projects. Some of the projects listed on Freshmeat are also listed in Sourceforge, and some of them are not. One way to tell which SF projects are listed on FM is to query our Freshmeat tables and ask which Freshmeat projects resolve to a "sf.net" or "sourceforge" URL:

SELECT count(*)
FROM fm_project_homepages
WHERE datasource_id=18
AND real_url_homepage LIKE "%sourceforge%"
OR real_url_homepage LIKE "%sf.net"


For the March 5 data (datasource=18), this yields 10278 results.

Other things we track about Freshmeat projects are the authors, the dependencies (what other software is this software dependent upon?), and how the project is classified in the trove.

The tables you'll be interested in are:

fm_project_authors
fm_project_dependencies
fm_project_homepages

Pages