megan's blog

RubyGems collections updated

RubyGems collections for datasource_id # 61243 (July 2017) have been updated. They can be found in the database in the "rubygems" schema, or as flat files (latest datasource_id only) on the FLOSSmole data server.

Some sample queries:

SELECT project_license, count(*)
FROM `rubygems_project_facts`
where datasource_id=61243
group by 1 order by 2 desc;

ObjectWeb Collections updated

We updated our ObjectWeb collections. You can find out more about the data we have, or download flat data files in raw format or query the data directly from the FLOSSmole database. The most recent datasource_id is 70912.

Some sample queries:

What licenses are used most by ObjectWeb projects?

SELECT description, count(*)
FROM ow_project_licenses
WHERE datasource_id = 70912

Microsoft CodePlex data

Codeplex was Microsoft's open source code forge. It began in 2006 and shut down in 2017. We collected the data at the time of shutdown, and provided it here at FLOSSmole for anyone to use.

Data is available in raw format or in the FLOSSmole database.

Sample graphics

SELECT create_date, count(*)
FROM cp_projects

Data Model

cp_projects table:
This table holds 400k records of projects that were - at some point - created on Codeplex. This includes "spam" projects and removed projects. However, not every project will have many details filled in about it. This is because the details come from their "index" page (see next table) and we were not able to get an index page for about 300k of the spam/canceled projects.

proj_name: the short name of the project
datasource_id: the collection number (70910 in this case)
proj_long_name: the long name of the project
proj_url: the url on codeplex
description: the long description of the project (some of these are quite long)
current_dl_version: the latest version number of the software that you can download
current_dl_date: the date that the current downloadable version was posted
download_count: the count of downloads of the latest version
proj_status: the last state the project was in (alpha, beta, etc)
proj_license: what license does this project use?
create_date: the date the project was first created (month/year only - comes from the history page)
last_updated: the last datetime that this record was updated in our database

cp_projects_indexes table (database only):
This table holds 100k home page, history page, and developers page records for every non-canceled project on Codeplex. NOTE: even though some spam projects were removed from the 400k list above, there are still many projects in this list that I would consider "spam" since they are gibberish and hold no real code or data, and are being used for advertising.

proj_name: the short name of the project
datasource_id: the collection number (70910)
home_html: the Codeplex home page for the project
history_html: the Codeplex history page for the project (this is where we get the cp_projects.create_date from)
people_html: the Codeplex developer page where the developers for each project are listed

Lists the 864k unique history events (such as page changes) for the projects, during each collection (70910 in this case). Each project can have multiple of the same event during a given period.

history_id: a unique identifier for each history event (primary key)
proj_name: the short name for a project
datasource_id: the collection number
month, year: when the history event took place
page_name: what page was updated
page_url: what is the url of the page that was updated
author, author_url: who updated the page?
last_updated: the timestamp for when this record was last updated

Lists the 49k individual people that are working on ANY Codeplex project.

username: the Codeplex system username of the person
datasource_id: the collection number
personal_statement: typed by the person, tells something about themselves
member_since: the date the person joined
last_visit: the last date they used the site
user_html: the home page for the user on Codeplex
last_updated: the timestamp for when this record was last updated

Lists the person, project, and what role they had on that project at the time of collection. A person can only have a single role on a given project at one time.

proj_name: the short name of the project
datasource_id: the collection number
username: the person's Codeplex username
role: what their job was on this project
project_member_since: when they joined the project
last_updated: the timestamp for when this record was last updated

Google Code Project Create Dates

Project creation dates for every Google Code project from February 4, 2011 (when they first started tracking project creation dates) and when Google Code was shut down March 12, 2015.

Click to enlarge

Data is available in a gzipped CSV file or in the FLOSSmole database.

LKML (email) study: data/paper available

We presented this paper at the 2016 OpenSym this week.

Schneider, D., Spurlock, S., and M. Squire. (2016). Differentiating Communication Patterns of Leaders on the Linux Kernel Mailing List. In Proceedings of the 12th International Symposium on Open Collaboration (OpenSym 2016).

Slides (Google Docs)
Data Available (Torvalds & Kroah-Hartman emails, source code removed): data dump, request MySQL database access

How do various corporations populate the Apache projects?

With the FLOSSmole Apache Project/Contributor/Roles data we updated earlier today, we thought an interesting initial analysis would be to figure out how various corporations populate the Apache projects (at least according to the official lists of contributors posted on each Apache project page).

Here is a list of the Apache projects with the highest density of participation by a single corporation:

We only show the first page of results here.

How did we get this result?

1. We used the Apache Contributor Data Set described in this previous FLOSSmole blog posting. Each project in Apache family of projects lists their members, and sometimes they list what company that person works for. Here is an example from the Geronimo Project.

Not every project lists its members, and not every project lists its members' affiliations.

2. We limited our analysis to datasource_id 65935, or May 18, 2016

3. We created a view in SQL so that we could more easily calculate the percentage of the total number of developers for each project:

SELECT project_name, count(*) as 'devcount'
FROM apache_people_projects
WHERE real_name IN (
SELECT distinct(real_name) FROM `apache_people_projects` WHERE datasource_id=65935
ORDER BY `apache_people_projects`.`real_name`)
AND datasource_id=65935

4. Then we ran this SQL query to generate the data shown in the table above. The rows are sorted by the highest percent.

SELECT app.project_name, app.organization, count(*) as 'org dev count', app2.devcount as 'all devs', cast((count(*)/devcount)*100 as decimal(4,2)) as 'pct of team' FROM apache_people_projects app
INNER JOIN apache_project_dev_count_65935 app2
ON app.project_name = app2.project_name
WHERE app.real_name
IN (SELECT distinct(real_name) FROM `apache_people_projects` where datasource_id=65935)
AND app.organization IS NOT NULL
AND app.organization !=""
AND app.datasource_id=65935
ORDER BY 5 DESC, 1 asc;

Interested in getting this data? Apache Contributor Data Set

Want to see more examples of how to use FLOSSmole data? Examples

New "Apache Projects & Contributors" data dump

I spent a few days in May updating the list of all the Apache project contributors (full name & Apache system name when available) and their organizations when available. This data set was first released in 2013 in the MSR paper entitled "Project Roles in the Apache Foundation: A Data Set".


  • svn_id
  • real_name
  • web_site
  • datasource_id
  • project_name
  • role_on_project
  • details
  • email
  • organization
  • timezone
  • last_updated

here is a sample of what the data looks like:

click to enlarge

Most of the fields are nullable since many times the data is incomplete.

Download the flat file, or use the live database ('apache') on the FLOSSmole MySQL server.

RubyGems data updated June 2016

Hello moles, the latest RubyGems data has been collected. We now have two RubyGems collections:

  • 61240: November 2015
  • 61243: June 2016

The data can be found in two places:

Tables include:

  • rubygems_project_authors (the author(s) listed for each gem/project)
  • rubygems_project_create_dates (earliest known release date for a gem/project)
  • rubygems_project_devdep (development dependencies for each gem/ project)
  • rubygems_project_facts (basic project metadata scraped from project page)
  • rubygems_project_links (the list of links provided by each gem/project, ex: home page, documentation, etc)
  • rubygems_project_owners (the owner(s) listed for each gem/project)
  • rubygems_project_pages (the html and rss where we got this data; one per gem, per datasource_id)
  • rubygems_project_rtdep (runtime dependencies for each gem/project)
  • rubygems_project_versions (each time the gem/project is released, it creates a new version)

Bitcoin-dev, Ubuntu, Perl6, Django, Puppet IRC logs are updated

Thanks to the work of my two summer research assistants Evan Ashwell & Greg Batchelor, the IRC channels for #bitcoin-dev, perl6, #ubuntu, #django, and puppet (#gen, #dev, and #razor) have been updated.

Things to know:

  • These IRC chats are only available on the FLOSSmole MySQL database server (how to get access) and not as flat files. Why? Well, they started out as flat files, so we don't want to just re-host flat archives. The original flat files are available for Puppet (, Bitcoin-dev (, Ubuntu (Ubuntu Logs), Perl6 (Perl6 logs), and Django (Django IRC logs)
  • The data model is one day = one datasource id
  • The chat logs have been divided into the following columns (some logs have fewer columns):
    • datasource_id
    • line_num
    • line_message
    • type
    • send_user
    • date_of_entry
    • time_of_entry
    • unix_time
    • last_updated
  • An example row looks like the following:
    • 61835
    • 42
    • ah thanks. I'll search.
    • message
    • arubi
    • 2016-05-28
    • 21:38:00
    • l1464471492.0
    • 2016-06-02 13:03:3

All paper metadata from OSS 2016 are in

All metadata & citations for papers from OSS 2016 have been uploaded to As I find pre-prints available online I will add those too. If you are an author and you wish to have FLOSSpapers host your pre-print, just email it to me (msquire at elon dot edu) and I'll add it to the site!

Syndicate content