New Google Code Data Released

Hello moles! I've released a new set of Google Code project data to our own downloads page (on Google Code, no less!) - the datasource_id is 226.

This data took over a month to collect. Included are the following:

--project names (info)
--project license, code and content (info)
--project summary (info)
--project description (info)
--project activity level (info)
--who works on what project and what their role is (people)
--what blogs are listed for each project (blogs)
--what links are listed for each project (links)
--what labels are used to describe each project (labels)

Crawlers vs API

Interesting article by some folks at 80Legs about crawling the web versus using API to gather data. On several occasions we've chosen to use an API rather than crawling. This pretty much summarizes the limitations around that choice.

Launchpad data released for June 2010

Introducing a new data source: Launchpad data. In this collection, Launchpad has about 19k projects in it and about 34k developers.

mysql> select count(*) from lp_projects where datasource_id=227;

mysql> select count(distinct dev_loginname) from lp_developer_indexes where datasource_id=227;

Available on our Google Code downloads page: Launchpad data

Github data released for May 2010

Data has been released for Github for May 2010. It is on our FLOSSmole Google Code downloads page.

Database Schema updated

The database schema page here on has been updated. I've got a single-page PNG of the schema, and an MWB file for those of you who like MySQL Workbench.

New changelog of activities

I've decided to start using the wiki on our Google Code site to mark changes as they are made to the project. You can see what we're working on and where we left off on a particular project. This is as much for me as it is for you! Sometimes I forget where I was in a particular project and this will help me.

Here is a link to the wiki of the May 2010 Changelog on Google Code.

May 2010 - all data backed up to Teragrid

It's been a long time since we shipped new data to Teragrid. I apologize for that oversight.

The new connection info is and then the mysql port and your username/password.

Differences/shortcomings/things to know:
1) We are still collecting data all the time so the Teragrid is always a bit behind the master collection at Elon and the project file downloads at Google Code

2) This includes datasources up to 222, although 223 is listed in the datasources table.

May 2010 Data released

May 2010 data is released for some forges.

-Freshmeat (datasource 218)
-Rubyforge (datasource 219)
-ObjectWeb (datasource 220)
-Free Software Fntn (datasource 221)
-Google Code (datasource 222) - list of projects only

Our collectors for Savannah, Sourceforge, Github, Tigris, Launchpad are all undergoing maintenance at the moment.

UPDATE May 28, 2010
-Savannah data has been released (datasource 224)

Link to download the FLOSSmole data on Google Code.

February 2010 Data Released

Lots of new data for you to peruse out on our FLOSSmole Data Downloads Page.

Here's what's out there, recently added:

Google Code, March 2010 (GC) - list of all GC projects donated by Audris Mockus (HUGE THANK YOU TO AUDRIS FOR THIS!!)
Freshmeat, February 2010 (FM)
Objectweb, February 2010 (OW)
Rubyforge, February 2010 (RF)
Github, February 2010 (GH)
Free Software Foundation, February 2010 (FSF)
Savannah, February 2010 (SV)
and Sourceforge from December 2009 (SF)

We have another set of bugs to fix with Sourceforge collection this year, 2010, but those are forthcoming. I'm running a collection now. Hopefully the data will be good. We may even have stats this time. Hallelujah.

Also, thanks to my phenomenal undergraduate superstar Steven Norris, Tigris is coming soon!! and Debian after that. We are rocking the repository collection...

December Sourceforge Data released

After long delay, the December Sourceforge data has been released. You may recall that over summer 2009, SF redesigned their web site which broke many of our crawlers and all of our parsers.

We have re-written these, and with only a few exceptions, have pretty much the same data as we always had.

Here are some release notes:

1. The Datasource_id=206
2. Donors data is not available in the Dec 2009 release. Donors were moved to their own page, so we have to add this to the collection for next time.
3. Statistics data is not available in the Dec 2009 release. We accidentally collected the wrong stats pages, so we had to throw these out and re-write for next time.
4. Status data (alpha, beta, mature, etc) is not available in the Dec 2009 release. This information is still being collected and kept by SF, but we can't find where it's being reported on their web site. If you have any ideas, send them to the mailing list (

Files are located at our Google Code page:

For those of you with database access on the sdsc server, I'll get these files over there ASAP.

Syndicate content