Google Code Project Create Dates

Project creation dates for every Google Code project from February 4, 2011 (when they first started tracking project creation dates) and when Google Code was shut down March 12, 2015.

Data is available in a gzipped CSV file or in the FLOSSmole database.

LKML (email) study: data/paper available

We presented this paper at the 2016 OpenSym this week.

Schneider, D., Spurlock, S., and M. Squire. (2016). Differentiating Communication Patterns of Leaders on the Linux Kernel Mailing List. In Proceedings of the 12th International Symposium on Open Collaboration (OpenSym 2016).

Data Available (Torvalds & Kroah-Hartman emails, source code removed): data dump, request MySQL database access

New "Apache Projects & Contributors" data dump

I spent a few days in May updating the list of all the Apache project contributors (full name & Apache system name when available) and their organizations when available. This data set was first released in 2013 in the MSR paper entitled "Project Roles in the Apache Foundation: A Data Set".


  • svn_id
  • real_name
  • web_site
  • datasource_id
  • project_name
  • role_on_project
  • details
  • email
  • organization
  • timezone
  • last_updated

here is a sample of what the data looks like:

Most of the fields are nullable since many times the data is incomplete.

Download the flat file, or use the live database ('apache') on the FLOSSmole MySQL server.

RubyGems data updated June 2016

Hello moles, the latest RubyGems data has been collected. We now have two RubyGems collections:

  • 61240: November 2015
  • 61243: June 2016

The data can be found in two places:

Tables include:

  • rubygems_project_authors (the author(s) listed for each gem/project)
  • rubygems_project_create_dates (earliest known release date for a gem/project)
  • rubygems_project_devdep (development dependencies for each gem/ project)
  • rubygems_project_facts (basic project metadata scraped from project page)
  • rubygems_project_links (the list of links provided by each gem/project, ex: home page, documentation, etc)
  • rubygems_project_owners (the owner(s) listed for each gem/project)
  • rubygems_project_pages (the html and rss where we got this data; one per gem, per datasource_id)
  • rubygems_project_rtdep (runtime dependencies for each gem/project)
  • rubygems_project_versions (each time the gem/project is released, it creates a new version)

Bitcoin-dev, Ubuntu, Perl6, Django, Puppet IRC logs are updated

Thanks to the work of my two summer research assistants Evan Ashwell & Greg Batchelor, the IRC channels for #bitcoin-dev, perl6, #ubuntu, #django, and puppet (#gen, #dev, and #razor) have been updated.

Things to know:

  • These IRC chats are only available on the FLOSSmole MySQL database server (how to get access) and not as flat files. Why? Well, they started out as flat files, so we don't want to just re-host flat archives. The original flat files are available for Puppet (puppetlogs.com), Bitcoin-dev (bitcoinstats.com), Ubuntu (Ubuntu Logs), Perl6 (Perl6 logs), and Django (Django IRC logs)
  • The data model is one day = one datasource id
  • The chat logs have been divided into the following columns (some logs have fewer columns):
    • datasource_id
    • line_num
    • line_message
    • type
    • send_user
    • date_of_entry
    • time_of_entry
    • unix_time
    • last_updated
  • An example row looks like the following:
    • 61835
    • 42
    • ah thanks. I'll search.
    • message
    • arubi
    • 2016-05-28
    • 21:38:00
    • l1464471492.0
    • 2016-06-02 13:03:3

RubyGems.org collection, Nov 2015

We have added RubyGems.org data under datasource_id 61240. RubyGems.org is the official gem host for Ruby projects.

The scripts we used to collect this data are available on Github and the SQL dumps are available on our data server. Direct database access is also available. Existing database users were given access to this new database on the MySQL server, called 'Rubygems'.

August 2015 Launchpad data

We have added Launchpad data under datasource_id 58458. Launchpad is a repository for projects affiliated with Ubuntu. Summer research assistant Gavan Roth wrote some scripts to collect this data.

--Download the flat files, or
--Access and query this data via the MySQL interface

Here is a query to show some of the data that is available:

SELECT programming_language, COUNT( * )
FROM `lpd_programming_languages`
WHERE datasource_id =58458

Database reorganization

Today my new research assistant Gavan & I are performing some maintenance tasks on the database, including a reorganization of the places where the data tables live. Hopefully this will mean that the data is much better organized.

Here is the github summary of what we are doing, and a brief summary below.

We will leave old copies of the most popular tables for a few days, in order to give everyone time to rework scripts, etc.

sf database to 'sourceforge' (remember this is only pre-2009 metadata)

al_* tables to 'alioth' schema
fm_* tables to 'freecode' schema
fsf_* tables to 'free_software_foundation' schema
gh_* tables to 'github' schema
gc_* tables to 'google_code' schema
lpd_* tables to 'launchpad schema
ow_* tables to 'objectweb' schema
rf_* tables to 'rubyforge' schema
sv_* tables to 'savannah' schema
tig_* tables to 'tigris' schema

Software Archaeology: GNUe IRC data & summaries

Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.

The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.

Here are some papers that reference the GNUe chat logs or the summaries:
--Zhou & Hovy (2005) Digesting virtual "geek" culture: the summarization of technical internet relay chats
--Elliott & Scacchi (2007) Free Software Development: Cooperation and Conflict in a Virtual Organizational Culture
--Ulthus & Aha. Multiparticipant chat analysis: A survey
--Sood, Mohamed, & Varma. Topic-focused summarization of chat conversations

Unfortunately, the group that put together the summaries ("Kernel Traffic") no longer has a web presence, and the summaries and original log files are no longer available at any of the locations those papers link to.

FLOSSmole to the rescue! Here are the files that have been reconstructed, using what we could find on the Wayback Machine (archive.org):
-1-original chat logs for GNUe
-2-original Kernel Traffic GNUe chat summaries
-3-text list of URLs for chat logs, taken from Archive.org (used to build #1 above)
-4-text list of URLs for XML summaries of the chat logs, taken from Archive.org (used to build #2 above)
-5-all source code for how I collected and parsed this data on Github
-6-all data loaded into the FLOSSmole database interface on MySQL server (get your username and password)

In the database (#6 in the list above), there are 6 tables:
--GNUeIRCLogs (the log files themselves)
--GNUeSummaryItems (the text & metadata for the weekly summary)
--GNUeSummaryMentions (all the people mentioned in each summary)
--GNUeSummaryPara (the paragraph summary text, links removed)
--GNUeSummaryParaQuote (the quoted text from the logs that made it into the summary itself as quoted text)
--GNUeSummaryTopic (the topic that the summarizer classified each summary into)

IRC log updates: perl6, ubuntu, django

Hi moles! New IRC chat logs now cleaned and stored in the irc database on the FLOSSmole mysql server, thanks to Andrea Black, one of our intrepid FLOSSmole research assistants. This data is part of an overall IRC collection started by another student, Becca Gazda, last summer.

We now have the following IRC chat histories available:


--general irc


Perl6 -- NEW

Coming soon : Bitcoin

