New Apache project IRC data

Hello moles! Happy January. Here are some fresh new data sources for your mining pleasure:

1. Freenode channel list and topics (all public channels with 3 or more users). The table is called "fn_irc_channels".
2. Apache Activemq IRC logs (one datasource_id per day, one row per message).
3. Apache Aries IRC logs
4. Apache Camel IRC logs
5. Apache CXF IRC logs
6. Apache Karaf IRC logs
7. Apache Kalumet IRC logs
8. Apache Servicemix IRC logs

here is a sample of what the structure looks like for 2-8:

CREATE TABLE `apache_servicemix_irc` (
`datasource_id` int(11) NOT NULL,
`line_num` int(11) NOT NULL,
`full_line_text` varchar(500) NOT NULL,
`line_message` varchar(500) NOT NULL,
`date_of_entry` date NOT NULL,
`time_of_entry` varchar(5) NOT NULL,
`type` enum('action','system','message') NOT NULL,
`about_user` varchar(50) NOT NULL,
`last_updated` datetime NOT NULL,
PRIMARY KEY (`datasource_id`,`line_num`)

These are available on the live MySQL connection.

New Apache People-Roles-Projects data

Hot off the presses! Another update to the Apache people-roles-projects data:

Datasources 1578-1585 have updated information on people working on Apache projects, including committer lists, PMC lists, PMC chairs, etc.

Timezones are also now being collected as well.

This is an update to the original dataset described in the paper "Project Roles in the Apache Software Foundation: A Dataset" (2013), written by yours truly.

Apache Camel data

We have released several files of Apache Camel IRC log data.

originally stored by Dan Kulp
More about Apache Camel

Related Data Sets
Apache Twitter Handles
Apache Project People & Roles

Sample Queries for the IRC data:

List the most prolific IRC posters, in order of their post count
SELECT about_user, count( * )
FROM apache_camel_irc

List the twitter handles and svn_ids (if known) for anyone who is also on Apache Camel's IRC
SELECT distinct i.about_user, t.twitter_screen_name, t.svn_id
FROM apache_camel_irc i
inner join apache_twitter t
on i.about_user = t.svn_id

The datasources for the IRC data are (currently) #393-1572. Each log file (daily) gets its own datasource_id (since each one is a separate source)!

December 2013 data released

December data has been released. We have a few old standbys (fc, rf, ow, sv, al) and some hot fresh data as well.

What is new, you ask? Well, we have some IRC chat log data for the Apache project Camel [1]. A nice new social data set, all parsed and organized into relational database format for you to query.

Get direct database access
Or, download the files either as delimited flat files or as .sql files ("marts")

[1] Why Apache Camel? Well, that project was one of the only Apache projects where we were able to get logs for both email in mbox format and IRC chat logs. More projects forthcoming...(we're working on that now)

Rubyforge goes into the sunset

We've been collecting Rubyforge data almost since the beginning. Last month we reported on the decline of Rubyforge in light of newer forges, like Github. Here's the chart we drew:

Now we've got this lovely pair of images to contend with:


It's apparently been offline for four or five days now.

We'll keep collecting as long as we can.

New home for flat files

Many of you know that we provide flat files of our data for download by anyone at any time. Until recently we had hosted these on Google Code (before 2009 or so, we hosted them on Sourceforge). Recently, Google Code announced that projects will not be able to have file downloads as of January 2014. So we had to find a new home for our files.

The new home is a server called FLOSSdata. Hope you enjoy the flat file access. As usual, if you require direct database access, we have a procedure for giving you that as well.

On FLOSSdata, there are about 3300 files, either delimited or SQL dumps. These are divided into forge abbreviations and then stored by date. Enjoy!

A decade of forges

We here at FLOSSmole have been gathering data about how free, libre, and open source software is made for about 10 years now, actually a little more.

In that time, a lot has changed in the forge landscape, both with the players and with the tools.

Just for fun, I decided to run a few quick queries to show the ascendance of Github and the concurrent decline of some smaller forges. These two graphs show the rate of new project creation (called 'registration' on Rubyforge and 'creation' at Github - and yes, the Github numbers do include forks).

click for full size

click for full size

Data Sources:
For the Rubyforge data, I used FLOSSmole datasource_id=388 (Sept 2013), available on our public database. The query is:

SELECT year(date_registered), month(date_registered), count(*)
FROM rf_projects
WHERE datasource_id=388

Then I fed the data into Google Charts, saved as a png and annotated it in Preview to add the little circles and stuff.

For the Github data, I used the very excellent Ghtorrent tool. The query is:

SELECT year(created_at), month(created_at), count(*)
FROM projects
ORDER BY 1, 2;

Then I fed that data into a separate Google chart.


6 Things to Know about Successful Open Source Projects

FLOSSmole and SRDA data were used in Internet Success: A Study of Open Source Software Commons by Charlie Schweik and Bob English, and now ran a story summarizing key findings from the book. If you haven't read it yet, do it!

Two new data sets

Hi moles! I've got two new datasets for you to play with. These aren't perfect, but they're a start of a new type of dataset for FLOSSmole!

(1) Apache Roles: This dataset stores information about people affiliated with all the subprojects of the Apache Software Foundation, their roles, and what project they're working on with that role. Data sources include: Apache web site pages, board meeting minutes, etc. (Pre-Print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Real_name John Smith
Project_name Apache Axiom
Role_on_project Committer
Organization BigCorp
Datasource_id 367
Details Appendix T

(2) Apache Twitter Screen Names: This dataset stores the verified twitter screen names of people affiliated with the Apache Software Foundation projects. Useful for matching to emails or source code commits, or to be used in tandem with the Apache roles dataset above. (Pre-print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Twitter_screen_name jsmith
Real_name John Smith
Project_name Apache Cayenne

Get the MySQL dumps on our FLOSSmole downloads page on Google Code or via direct database access.

Got a cool FLOSS-oriented dataset you want to share? If you wish to donate data to FLOSSmole, we can host it.

FLOSSdata move is complete

Hello moles! I have completed the move of all our data from the Teragrid over to FLOSSdata and it is ready to go.

There are some new things to know:

1. New schemas for 'old', 'sf' and 'udd' data. You're used to seeing those in the 'ossmole_merged' schema, but they've been moved out into their own schemas.

2. Reminder that the 'sf' data is quite old and we recommend that you use SRDA instead.

3. There are some new tables coming for Apache data. More information on these will be forthcoming.

4. Github was updated with 3.5 million new repos which means a lot of possibilities for those of you mining GH! The datasource_id for those is 359.

5. If you still need a username / password on the new system, let me know (megansquire at gmail) and I will set you up.

