June Data: Google Code, Launchpad, Github

Summer is a beautiful thing. Moles, we've got a huge Google Code release for you (ds=271), and the re-vamped Launchpad (ds=272), and also Github (ds=273).

Get your FRESH June data on our Google Code Downloads Page or LIVE on the Teragrid.

Tigris is fixed and is running right now. We're also writing a new collector for Alioth! Lots of new stuff.

Got a bug in the Freshmeat collector, so I'm wrangling that. Thanks to a user for reporting that bug. Don't forget we do have a bug-tracking system on Google Code.

Finally, we've got a fresh UDD upload and Debian data coming soon also. We're just so productive right now!

Also don't forget to check out our collection of Everything You Ever Wanted to Know About Code Forges - data also available on our Google Code download site.

May 2011 Data Released

May 2011 data has been released to Google Code and uploaded into Data Central at Teragrid.

263 2011-Mar UDD bugfix replaces 262
264 2011-Mar UDD bugfix replaces 263
265 2011-May UDD May 2011 UDD donation
266 Rubyforge 2011-May Rubyforge 2011-May
267 Objectweb 2011-May Objectweb 2011-May
268 FSF 2011-May Free Software Foundation 2011-May
269 Savannah 2011-May Savannah 2011-May
270 2011-May FM May 2011 Freshmeat

Status of other collectors:
Launchpad - parsing problem
Tigris - mailing list collector problem
Github - collection problem
Google Code - still running (it will be about a month until these are out)

Link to FLOSSmole files on Google Code
Link to instructions for how to access FLOSSmole db at Teragrid

Debian data, Ultimate Debian Database

Hello moles. A quick update on the Debian collections.

I told you earlier that we'd been collecting some Debian data and calculating software engineering metrics for each C/C++ package, and providing that data on both the raw data downloads page and in the database at Teragrid.

Well, today we also integrated the Ultimate Debian Database[1] into the FLOSSmole family of collections. Thank you to my super-talented student Carter Kozak for this work. (Such a great student, and only a second year undergraduate!) As you may know, UDD is a rich source of information about all the many facets of Debian. The UDD project provides a data dump every two days in PostgreSQL format. We've taken the Postgres tables and converted them to Mysql and thrown all the data into our FLOSSmole database.

(Thanks to everyone for being super-patient on this. We've been in agreement about integrating UDD into FLOSSmole since Fall of 2009 [2] but technical problems and a lack of time conspired to make this horrible delay.)

What this means for you is two things:
(1) If you want to poke around in the Ultimate Debian Database but you don't want to install your own PostgreSQL server and download their dump file, you can use our install on the Teragrid (instructions for getting a Teragrid login [3])

(2) If you want to tie the UDD info to our Debian metrics info, you can now do that from within the same database instance. Fun! (Example below)

Quick How-To:
We grab the UDD data periodically and give it a datasource_id (the current one is 262), just like our other collections. If you want to tie UDD to the Debian metrics we generate, simply find the UDD collection from the about same time period, grab the two datasource IDs, and inner join the day away... Package name is a common field between our metrics and the UDD metrics, but note that it is not always perfect [4]. Example: "Give me the Debian packages, their lines of code and their Popcon vote, and sort them from most popular to least popular". Beware, there are some quirks. [5]

SELECT dm.project_name, dm.loc,
FROM deb_metrics dm
INNER JOIN udd_popcon uddp ON dm.project_name = uddp.package
WHERE dm.datasource_id =261
AND uddp.datasource_id =262

project_name loc vote
gzip 13041 95885
findutils 59560 95796
sed 24306 95734

Whew! Need technical help? Try the FLOSSmole discussion list (my first choice) or the #flossmole IRC channel on Freenode (my second choice since I'm not always on there).


[4] We don't collect metrics on non-C/C++ code for instance, so those packages may show up in popcon and not in our list of packages. Also, there are thousands more packages in popcon than are in our standard Debian install that we use for the metrics, so we don't have metrics on packages that aren't in the official Debian base install. Finally, packages that have lots of sub-sub-libraries and that sort of thing may not have the same names exactly.
[5] Other quirks: Right now a lot of the data types supported by PostgreSQL we hastily converted to not-quite-equivalent MySQL types. This means that there are a lot of tables with fields that are marked as "text" that should not be. There are also limited numbers of indexes - although I tried to add them where I could. So we're working on fixing that situation.

Bug on Google File Download page

Heads up Moles - If you've been searching our very, very long file list on the Google Code site, you might have noticed that "search" acts strangely over there. (Strange that Google Code would have search issues...but anyway...)

Today I turned in bug #5211, for some odd behavior in the way search results are returned for (in this case) the keyword "Debian".

Hopefully they'll get back to me soon; I hope this is not due to our very large number of files. That has broken our page at Google Code before.

UPDATE: Fixed. It was something about re-indexing our list.

Mozilla uses metrics

Here is an interesting post about how Mozilla is beginning to study itself using metrics gathered about contributors and contributions over time. They create charts and tables about patch rates and the like.

mozilla project
image link goes to original article

April 2011 releases

Here is the current list of files heading up to our downloads page.

ObjectWeb, datasource 257
Free Software Foundation, datasource 258
Freshmeat, 255
Rubyforge, 256
Savannah, 259
Debian Metrics (January), 254
Debian Metrics (March), 261

DOI will be generated following the release of all files.
Everything is backed up to Teragrid if you prefer direct DB access.

March data for Google Code posted

Here is the March 2011 data for Google Code projects, available on our own GC page. Upload to Teragrid is happening now if you prefer direct db access.

The datasource is 252. Available data includes:

--basic project info for each project on Google Code
--links for each project
--people on each project (some hashed)
--blogs for each project
--labels for each project
--groups for each project

2011 database schema update

I've updated the database schema to show some of the newer forges and their tables. To open the .mwb file, use MySQL Workbench. The other file is just a PNG of the same information.

New data for March 2011

Most of the March data has been released to our page on Google Code. Included forge collections are: Free Software Foundation, Freshmeat, Rubyforge, Objectweb, Savannah, Tigris. Google Code is still running. Github and Launchpad are not functional right now (waiting on a bug fixes).

There are two ways to get the data:
You can download the data at our downloads page - the flat files are so marked, and the SQL files are marked "datamarts". Note that datamarts only contain the latest collection. If you want previous months' worth of data, you'll have to grab those datamarts too.

You can also log into our database on the Teragrid and live-query the data. Read these instructions on getting a login.

Have fun!

Jan/Feb 2011 data uploaded to Teragrid

I've backed up the Jan/Feb data to Teragrid for your live queries. Be sure to log in there and use your database querying tool of choice to check out the data. (If you need an account, read these instructions for how to get yourself an account.)

The datasource_id information is as follows:

237 FM-Freshmeat
238 RF-Rubyforge
239 OW-ObjectWeb
240 FSF-FreeSoftwareFndtn
241 SV-Savannah
243 GC-GoogleCode
244 TG-Tigris
246 - Debian metrics


Syndicate content