Google Code data available

Google Code is our longest data collection effort each month. We've collected everything for November and posted it for your data mining pleasure. Get the files or access it on the Teragrid with direct database access (datasource_id=285).

Freshmeat becomes Freecode, and how our data is affected

Three things happened recently to affect our Freshmeat collection

1. Freshmeat announced a name change to Freecode.
2. We have an issue (issue #43) that talks about how the trove definitions for Freshmeat are out of date.
3. Freshmeat replaced trove with tagging and we missed the memo

What I've done is as follows:

For issue #1 - decided not to rename our abbreviation for Freshmeat. It will remain "FM".

For issue #2 & 3 - Added a new table to hold the tags associated with a project. It's called fm_projects_tags.

CREATE TABLE IF NOT EXISTS `fm_project_tags` (
`project_id` int(11) NOT NULL DEFAULT '0',
`datasource_id` int(11) NOT NULL DEFAULT '0',
`tag_name` varchar(50) NOT NULL DEFAULT '0',
`timestamp` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`project_id`,`datasource_id`,`tag_name`),
KEY `datasource_id_index18` (`datasource_id`)

Added a new release file to hold the data from this table. The new file is called fmProjectTags2011-Nov.txt. Did not remove trove; we are still collecting the trove. Although there is no longer any "trove definition" list that I know of to describe each trove number, so these are not as useful as the "tags". But I'm leaving this alone in the database for historical purposes.

Here is a shot of the tags page for a sample project on Freshmeat, called amms.

Here is a shot of the way the tags look now in our release files (or database table) for that same project (#78922)

November 2011 data entered

Here is the status of the November 2011 collection:

done & ready to download on Google Code or query in Teragrid...

still collecting...

collectors broken and waiting to be fixed...
UDD (BUG # 50)

FLOSSmole as a catalyst for research

One of the papers at the 2011 OSS conference is entitled "Building Knowledge in Open Source Software Research in Six Years of Conferences". It surveys the contributions of papers presented at the OSS conferences, and builds social networks of the papers, identifying research streams along the way.

Findings particular to FLOSSmole:

"Cluster #82. The largest cluster originates from node #82. Paper #82 introduces the OSSmole project (later called FLOSSmole). OSSmole is a repository of data, scripts, and analysis of data collected from OSS projects."


"Large clusters are initiated by empirical papers with the only exception being the paper on the FLOSSmole repository."


"Papers with a large number of citations [ed: such as FLOSSmole paper] are synthesizers of research often presenting a framework or a platform to guide research in OSS."


"In particular, we have found that the creation of a big repository for data mining (FLOSSmole) has originated research in social network analysis, tools for data mining, and analysis of code artefacts to understand maintenance processes, specifically bug fixing."

Current challenges for Fall

1. Free Software Foundation directory changed their layout to a wiki so we're re-writing our collector to parse RDF instead. This will change the tables we use for FSF data now.

2. We were able to convince our dear colleague Audris Mockus to run his Google Code collector and gather the latest list of project names for us. SWEET! This means a Google Code run is imminent.

3. UDD and Debian still need to be re-run, and automated.

4. In case you are keeping track of the different forges, Berlios is shutting down as of Dec 31 2011.

We're still plugging along with all this stuff. Hope you are finding the data helpful. Let us know what we can provide. (and join the Mailing List!)

Forges paper pre-print

Here is the pre-print copy of the paper on forges that David and I have written. I am going to present at HICSS 45 in January.

Squire, M. and Williams, D. (2012). Describing the software forge ecosystem. 45th Hawaii International Conference on System Sciences. Maui, Hawaii. January 4-7. Forthcoming.

September 2011 data, in progress

Here is the status of each collection for September 2011:

The stages are
1. collecting (some projects have sub-stages here)
2. parsing
3. files released to Google Code
4. data released to Teragrid

UPDATED as of 05-Sep-2011 at 12:41PM:
Freshmeat - collector/parser being re-written for accuracy and bugfixes

Rubyforge - files released to Google Code & data uploaded into Teragrid

Objectweb - files released to Google Code & data uploaded into Teragrid

Free Software Foundation Directory - files released to Google Code & data uploaded into Teragrid

Savannah - files released to Google Code & data uploaded into Teragrid

Github - files released to Google Code & data uploaded into Teragrid

Tigris - files released to Google Code & data uploaded into Teragrid

Google Code - I am looking at getting a new list of projects as the one we've been using is quite old now (Oct 2010)

Launchpad - files released to Google & data uploaded to Teragrid

Alioth - files released to Google Code & data uploaded to Teragrid (new tables made)

Debian Metrics - waiting on README

Ultimate Debian Database - importing into database; error on table create

June Data: Google Code, Launchpad, Github

Summer is a beautiful thing. Moles, we've got a huge Google Code release for you (ds=271), and the re-vamped Launchpad (ds=272), and also Github (ds=273).

Get your FRESH June data on our Google Code Downloads Page or LIVE on the Teragrid.

Tigris is fixed and is running right now. We're also writing a new collector for Alioth! Lots of new stuff.

Got a bug in the Freshmeat collector, so I'm wrangling that. Thanks to a user for reporting that bug. Don't forget we do have a bug-tracking system on Google Code.

Finally, we've got a fresh UDD upload and Debian data coming soon also. We're just so productive right now!

Also don't forget to check out our collection of Everything You Ever Wanted to Know About Code Forges - data also available on our Google Code download site.

May 2011 Data Released

May 2011 data has been released to Google Code and uploaded into Data Central at Teragrid.

263 2011-Mar UDD bugfix replaces 262
264 2011-Mar UDD bugfix replaces 263
265 2011-May UDD May 2011 UDD donation
266 Rubyforge 2011-May Rubyforge 2011-May
267 Objectweb 2011-May Objectweb 2011-May
268 FSF 2011-May Free Software Foundation 2011-May
269 Savannah 2011-May Savannah 2011-May
270 2011-May FM May 2011 Freshmeat

Status of other collectors:
Launchpad - parsing problem
Tigris - mailing list collector problem
Github - collection problem
Google Code - still running (it will be about a month until these are out)

Link to FLOSSmole files on Google Code
Link to instructions for how to access FLOSSmole db at Teragrid

Debian data, Ultimate Debian Database

Hello moles. A quick update on the Debian collections.

I told you earlier that we'd been collecting some Debian data and calculating software engineering metrics for each C/C++ package, and providing that data on both the raw data downloads page and in the database at Teragrid.

Well, today we also integrated the Ultimate Debian Database[1] into the FLOSSmole family of collections. Thank you to my super-talented student Carter Kozak for this work. (Such a great student, and only a second year undergraduate!) As you may know, UDD is a rich source of information about all the many facets of Debian. The UDD project provides a data dump every two days in PostgreSQL format. We've taken the Postgres tables and converted them to Mysql and thrown all the data into our FLOSSmole database.

(Thanks to everyone for being super-patient on this. We've been in agreement about integrating UDD into FLOSSmole since Fall of 2009 [2] but technical problems and a lack of time conspired to make this horrible delay.)

What this means for you is two things:
(1) If you want to poke around in the Ultimate Debian Database but you don't want to install your own PostgreSQL server and download their dump file, you can use our install on the Teragrid (instructions for getting a Teragrid login [3])

(2) If you want to tie the UDD info to our Debian metrics info, you can now do that from within the same database instance. Fun! (Example below)

Quick How-To:
We grab the UDD data periodically and give it a datasource_id (the current one is 262), just like our other collections. If you want to tie UDD to the Debian metrics we generate, simply find the UDD collection from the about same time period, grab the two datasource IDs, and inner join the day away... Package name is a common field between our metrics and the UDD metrics, but note that it is not always perfect [4]. Example: "Give me the Debian packages, their lines of code and their Popcon vote, and sort them from most popular to least popular". Beware, there are some quirks. [5]

SELECT dm.project_name, dm.loc,
FROM deb_metrics dm
INNER JOIN udd_popcon uddp ON dm.project_name = uddp.package
WHERE dm.datasource_id =261
AND uddp.datasource_id =262

project_name loc vote
gzip 13041 95885
findutils 59560 95796
sed 24306 95734

Whew! Need technical help? Try the FLOSSmole discussion list (my first choice) or the #flossmole IRC channel on Freenode (my second choice since I'm not always on there).


[4] We don't collect metrics on non-C/C++ code for instance, so those packages may show up in popcon and not in our list of packages. Also, there are thousands more packages in popcon than are in our standard Debian install that we use for the metrics, so we don't have metrics on packages that aren't in the official Debian base install. Finally, packages that have lots of sub-sub-libraries and that sort of thing may not have the same names exactly.
[5] Other quirks: Right now a lot of the data types supported by PostgreSQL we hastily converted to not-quite-equivalent MySQL types. This means that there are a lot of tables with fields that are marked as "text" that should not be. There are also limited numbers of indexes - although I tried to add them where I could. So we're working on fixing that situation.

Syndicate content