megan's blog

April Sourceforge data released

The April Sourceforge data has been released. You can pick up the files from our Sourceforge file release page.

Here's what's included:
Package: sfProjectInfo
Release: sfProjectInfo02-Apr-2006
Files:
--ProjectList02-Apr-2006.csv.bz2: list of just project names
--ProjectInfo02-Apr-2006.csv.bz2: list of all basic project info (i.e. number of developers, registration dates, etc)
--ProjectDescriptions02-Apr-2006.csv.bz2: project names and their text descriptions (this file is quite large)

Package: sfRawDeveloperData
Release: sfRawDeveloperData02-Apr-2006
Files:
--sfRawDevelopers02-Apr-2006.csv.bz2: list of all developers
--sfRawDevProjects02-Apr-2006.csv.bz2: list of which projects are worked on by which developers

Package: sfRawData
Release: sfRawData02-Apr-2006
Files:
--sfRawDbEnvData02-Apr-2006.csv.bz2: list of projects and their database environments
--sfRawDonorData02-Apr-2006.csv.bz2: list of projects and their donors
--sfRawIntAudData02-Apr-2006.csv.bz2: list of projects and their intended audiences
--sfRawLicenseData02-Apr-2006.csv.bz2: list of projects and their open source licenses
--sfRawOpSysData02-Apr-2006.csv.bz2: list of projects and their operating systems

February 2006 files released

Sourceforge data has been released for February, 2006. Get the files from our Sourceforge file release page.

What's included in this release:

Package: sfProjectInfo
Release: sfProjectInfo02-Feb-2006
Files:
--ProjectList02-Feb-2006.csv.bz2: list of just project names
--ProjectInfo02-Feb-2006.csv.bz2: list of all basic project info
--ProjectDescriptions02-Feb-2006.csv.bz2: project names and their text descriptions (this file is quite large)

Package: sfRawDeveloperData
Release: sfRawDeveloperData02-Feb-2006
Files:
--developers02-Feb-2006.csv.bz2: list of all developers
--developer_projects02-Feb-2006.csv.bz2: list of which projects are worked on by which developers

Package: sfRawData
Release: sfRawData02-Feb-2006
Files:
--project_dbenv02-Feb-2006.csv.bz2: list of projects and their database environments
--project_donors02-Feb-2006.csv.bz2: list of projects and their donors
--project_intaud02-Feb-2006.csv.bz2: list of projects and their intended audiences
--project_licenses02-Feb-2006.csv.bz2: list of projects and their open source licenses
--project_opsys02-Feb-2006.csv.bz2: list of projects and their operating systems

tips for using the query tool

NOTE: This message describes an old query tool. The old query tool has been replaced with the new query tool. The new tool is located here: New Query Tool


Original message:
If you use the query tool, be aware that the amount of data in some of our tables is truly immense.

Tips:

1. do a "describe" on each table to see what's in there first:

"describe fm_projects"

This will tell you the structure of the table.

2. If you want to see a few sample rows, and you feel as though you simply MUST do a "select *", at least do your select with a mysql-style "limit" phrase like this:

"select * from fm_projects limit 25"

3. If you get an error describing something like a "timeout", this means your query was probably just too large. Email or chat with us on IRC or AIM to figure out what is wrong or a way to optimize the query.

4. Use the text files - many of the queries you want are the same queries that everyone wants! So we've taken the liberty of making text files of these items for your convenience.

tidbit: freshmeat and sourceforge

Freshmeat (FM) describes itself thusly: "freshmeat maintains the Web's largest index of Unix and cross-platform software, themes and related 'eye-candy', and Palm OS software."

And Sourceforge (SF) is, of course, "the world's largest Open Source software development web site, hosting more than 100,000 projects and over 1,000,000 registered users with a centralized resource for managing projects, issues, communications, and code."

Here at FLOSSmole, we keep tabs on Freshmeat AND Sourceforge projects. Some of the projects listed on Freshmeat are also listed in Sourceforge, and some of them are not. One way to tell which SF projects are listed on FM is to query our Freshmeat tables and ask which Freshmeat projects resolve to a "sf.net" or "sourceforge" URL:

SELECT count(*)
FROM fm_project_homepages
WHERE datasource_id=18
AND real_url_homepage LIKE "%sourceforge%"
OR real_url_homepage LIKE "%sf.net"


For the March 5 data (datasource=18), this yields 10278 results.

Other things we track about Freshmeat projects are the authors, the dependencies (what other software is this software dependent upon?), and how the project is classified in the trove.

The tables you'll be interested in are:

fm_project_authors
fm_project_dependencies
fm_project_homepages

Some pretty pictures to amuse you...

...while we get February's data parsed and loaded.

These graphs showing December 2005 trends were made by FLOSSmole's newest developer.

Most connected developers - I really like this chart because it shows that the most connected developer on Sourceforge (i.e. member of the most projects) is a graphic designer! How cool is that? This makes perfect sense when you think about it for a second, but it wouldn't have been MY first guess.

Here are some older charts, similar to things we have run before - these graphs show the kinds of reports you can run using FLOSSmole data:

Database Environments

SF project descriptions

We got a request for Sourceforge project descriptions. These are the little paragraphs that the project owners write to describe a given project. I've parsed out the descriptions and put them in this file release. Also, I created a new table called project_description to hold this information if you're using the query tool.

freshmeat dec and jan

December and January Freshmeat files have been added as datasource_ids 14(Dec) and 15(Jan). Use the Query Tool to explore the fm_* tables (these are the tables that hold the freshmeat data).

december 2005 data

We've run December 2005 Sourceforge data; the raw html has been stored as datasource_id #13 if you're using the query tool, otherwise, text files are over here at sourceforge on our project page.

We've got the usual stuff, all the Sourceforge project names, all project data, developer counts, who is working on what projects, what programming languages are being used, operating system counts, all that good stuff. Have fun!

query tool

version .01 of our query tool is up and running. Thanks, Dawid!

October Data, updated

The SF and Freshmeat (surprise!) data collections for October are DONE. We had a 10-machine grid working to collect this time. Very speedy! We plan to move to collections on a 60-day rotation, rather than 90-days from here on out. This will match up nicely with the 60-day sourceforge stats interval also.

Also, we have a working prototype of our live query tool -thanks Dawid!- we're just waiting for the production environment to be set up and that will be available for you all to use.

Here is the master file list on our SF project page: Master List of FLOSSmole Files, but we also have quicker links to:

Pages