Debian data, Ultimate Debian Database

Hello moles. A quick update on the Debian collections.

I told you earlier that we'd been collecting some Debian data and calculating software engineering metrics for each C/C++ package, and providing that data on both the raw data downloads page and in the database at Teragrid.

Well, today we also integrated the Ultimate Debian Database[1] into the FLOSSmole family of collections. Thank you to my super-talented student Carter Kozak for this work. (Such a great student, and only a second year undergraduate!) As you may know, UDD is a rich source of information about all the many facets of Debian. The UDD project provides a data dump every two days in PostgreSQL format. We've taken the Postgres tables and converted them to Mysql and thrown all the data into our FLOSSmole database.

(Thanks to everyone for being super-patient on this. We've been in agreement about integrating UDD into FLOSSmole since Fall of 2009 [2] but technical problems and a lack of time conspired to make this horrible delay.)

What this means for you is two things:
(1) If you want to poke around in the Ultimate Debian Database but you don't want to install your own PostgreSQL server and download their dump file, you can use our install on the Teragrid (instructions for getting a Teragrid login [3])

(2) If you want to tie the UDD info to our Debian metrics info, you can now do that from within the same database instance. Fun! (Example below)

Quick How-To:
We grab the UDD data periodically and give it a datasource_id (the current one is 262), just like our other collections. If you want to tie UDD to the Debian metrics we generate, simply find the UDD collection from the about same time period, grab the two datasource IDs, and inner join the day away... Package name is a common field between our metrics and the UDD metrics, but note that it is not always perfect [4]. Example: "Give me the Debian packages, their lines of code and their Popcon vote, and sort them from most popular to least popular". Beware, there are some quirks. [5]

SELECT dm.project_name, dm.loc, uddp.vote
FROM deb_metrics dm
INNER JOIN udd_popcon uddp ON dm.project_name = uddp.package
WHERE dm.datasource_id =261
AND uddp.datasource_id =262
ORDER BY 3 DESC

project_name loc vote
gzip 13041 95885
findutils 59560 95796
sed 24306 95734
etc...

Whew! Need technical help? Try the FLOSSmole discussion list (my first choice) or the #flossmole IRC channel on Freenode (my second choice since I'm not always on there).

-megan

[1] http://udd.debian.org
[2] http://www.mail-archive.com/debian-qa@lists.debian.org/msg15528.html
[3] http://flossmole.org/content/direct-db-access-flossmole-collection-available
[4] We don't collect metrics on non-C/C++ code for instance, so those packages may show up in popcon and not in our list of packages. Also, there are thousands more packages in popcon than are in our standard Debian install that we use for the metrics, so we don't have metrics on packages that aren't in the official Debian base install. Finally, packages that have lots of sub-sub-libraries and that sort of thing may not have the same names exactly.
[5] Other quirks: Right now a lot of the data types supported by PostgreSQL we hastily converted to not-quite-equivalent MySQL types. This means that there are a lot of tables with fields that are marked as "text" that should not be. There are also limited numbers of indexes - although I tried to add them where I could. So we're working on fixing that situation.

Tags: