warning: Creating default object from empty value in /var/www/drupal/modules/taxonomy/taxonomy.pages.inc on line 33.


How do various corporations populate the Apache projects?

With the FLOSSmole Apache Project/Contributor/Roles data we updated earlier today, we thought an interesting initial analysis would be to figure out how various corporations populate the Apache projects (at least according to the official lists of contributors posted on each Apache project page).

Here is a list of the Apache projects with the highest density of participation by a single corporation:

We only show the first page of results here.

How did we get this result?

1. We used the Apache Contributor Data Set described in this previous FLOSSmole blog posting. Each project in Apache family of projects lists their members, and sometimes they list what company that person works for. Here is an example from the Geronimo Project.

Not every project lists its members, and not every project lists its members' affiliations.

2. We limited our analysis to datasource_id 65935, or May 18, 2016

3. We created a view in SQL so that we could more easily calculate the percentage of the total number of developers for each project:

SELECT project_name, count(*) as 'devcount'
FROM apache_people_projects
WHERE real_name IN (
SELECT distinct(real_name) FROM `apache_people_projects` WHERE datasource_id=65935
ORDER BY `apache_people_projects`.`real_name`)
AND datasource_id=65935

4. Then we ran this SQL query to generate the data shown in the table above. The rows are sorted by the highest percent.

SELECT app.project_name, app.organization, count(*) as 'org dev count', app2.devcount as 'all devs', cast((count(*)/devcount)*100 as decimal(4,2)) as 'pct of team' FROM apache_people_projects app
INNER JOIN apache_project_dev_count_65935 app2
ON app.project_name = app2.project_name
WHERE app.real_name
IN (SELECT distinct(real_name) FROM `apache_people_projects` where datasource_id=65935)
AND app.organization IS NOT NULL
AND app.organization !=""
AND app.datasource_id=65935
ORDER BY 5 DESC, 1 asc;

Interested in getting this data? Apache Contributor Data Set

Want to see more examples of how to use FLOSSmole data? Examples

New "Apache Projects & Contributors" data dump

I spent a few days in May updating the list of all the Apache project contributors (full name & Apache system name when available) and their organizations when available. This data set was first released in 2013 in the MSR paper entitled "Project Roles in the Apache Foundation: A Data Set".


  • svn_id
  • real_name
  • web_site
  • datasource_id
  • project_name
  • role_on_project
  • details
  • email
  • organization
  • timezone
  • last_updated

here is a sample of what the data looks like:

click to enlarge

Most of the fields are nullable since many times the data is incomplete.

Download the flat file, or use the live database ('apache') on the FLOSSmole MySQL server.

New schema for IRC data

In my continuing quest to be organized, I've created a new schema to hold just the IRC log data. On the database server (access instructions here), there is a new schema called 'irc' and it includes (for now) Ubuntu logs, Django logs, 7 Apache projects, and the topic lines from Freenode for all channels with 3+ users.

Coming soon: email updates, including Linux Kernel Mailing List (LKML) and more IRC (Wordpress, etc).


New Apache project IRC data

Hello moles! Happy January. Here are some fresh new data sources for your mining pleasure:

1. Freenode channel list and topics (all public channels with 3 or more users). The table is called "fn_irc_channels".
2. Apache Activemq IRC logs (one datasource_id per day, one row per message).
3. Apache Aries IRC logs
4. Apache Camel IRC logs
5. Apache CXF IRC logs
6. Apache Karaf IRC logs
7. Apache Kalumet IRC logs
8. Apache Servicemix IRC logs

here is a sample of what the structure looks like for 2-8:

CREATE TABLE `apache_servicemix_irc` (
`datasource_id` int(11) NOT NULL,
`line_num` int(11) NOT NULL,
`full_line_text` varchar(500) NOT NULL,
`line_message` varchar(500) NOT NULL,
`date_of_entry` date NOT NULL,
`time_of_entry` varchar(5) NOT NULL,
`type` enum('action','system','message') NOT NULL,
`about_user` varchar(50) NOT NULL,
`last_updated` datetime NOT NULL,
PRIMARY KEY (`datasource_id`,`line_num`)

These are available on the live MySQL connection.

New Apache People-Roles-Projects data

Hot off the presses! Another update to the Apache people-roles-projects data:

Datasources 1578-1585 have updated information on people working on Apache projects, including committer lists, PMC lists, PMC chairs, etc.

Timezones are also now being collected as well.

This is an update to the original dataset described in the paper "Project Roles in the Apache Software Foundation: A Dataset" (2013), written by yours truly.

Apache Camel data

We have released several files of Apache Camel IRC log data.

originally stored by Dan Kulp
More about Apache Camel

Related Data Sets
Apache Twitter Handles
Apache Project People & Roles

Sample Queries for the IRC data:

List the most prolific IRC posters, in order of their post count
SELECT about_user, count( * )
FROM apache_camel_irc

List the twitter handles and svn_ids (if known) for anyone who is also on Apache Camel's IRC
SELECT distinct i.about_user, t.twitter_screen_name, t.svn_id
FROM apache_camel_irc i
inner join apache_twitter t
on i.about_user = t.svn_id

The datasources for the IRC data are (currently) #393-1572. Each log file (daily) gets its own datasource_id (since each one is a separate source)!

December 2013 data released

December data has been released. We have a few old standbys (fc, rf, ow, sv, al) and some hot fresh data as well.

What is new, you ask? Well, we have some IRC chat log data for the Apache project Camel [1]. A nice new social data set, all parsed and organized into relational database format for you to query.

Get direct database access
Or, download the files either as delimited flat files or as .sql files ("marts")

[1] Why Apache Camel? Well, that project was one of the only Apache projects where we were able to get logs for both email in mbox format and IRC chat logs. More projects forthcoming...(we're working on that now)

Two new data sets

Hi moles! I've got two new datasets for you to play with. These aren't perfect, but they're a start of a new type of dataset for FLOSSmole!

(1) Apache Roles: This dataset stores information about people affiliated with all the subprojects of the Apache Software Foundation, their roles, and what project they're working on with that role. Data sources include: Apache web site pages, board meeting minutes, etc. (Pre-Print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Real_name John Smith
Project_name Apache Axiom
Role_on_project Committer
Organization BigCorp
Email jsmith@bigcorp.com
Web_site http://www.apache.org/~jsmith
Datasource_id 367
Details Appendix T

(2) Apache Twitter Screen Names: This dataset stores the verified twitter screen names of people affiliated with the Apache Software Foundation projects. Useful for matching to emails or source code commits, or to be used in tandem with the Apache roles dataset above. (Pre-print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Twitter_screen_name jsmith
Real_name John Smith
Project_name Apache Cayenne

Get the MySQL dumps on our FLOSSmole downloads page on Google Code or via direct database access.

Got a cool FLOSS-oriented dataset you want to share? If you wish to donate data to FLOSSmole, we can host it.

Syndicate content