megan's blog

Mining Software Repositories '16 paper, slides & data

This weekend I'll be presenting at Mining Software Repositories 2016 in Austin, TX. My talk is in the data sets track, and it is entitled Data Sets: The Circle of Life in Ruby Hosting, 2003-2015(PDF). Here are the slides. collection, Nov 2015

We have added data under datasource_id 61240. is the official gem host for Ruby projects.

The scripts we used to collect this data are available on Github and the SQL dumps are available on our data server. Direct database access is also available. Existing database users were given access to this new database on the MySQL server, called 'Rubygems'.

Data Resources: 

OSS 2016 The 12th International Conference on Open Source Systems

The International Conference on Open Source Systems (OSS) is a long-standing international forum for researchers, practitioners from business and industry, enthusiasts, and students to present and discuss the latest trends, experiences, and concerns in the field of Free/Libre Open Source Software.

The 12th OSS Conference will take place in will take place in the city of Gothenburg, in 30 May - 02 June 2016.

Call for conributions

August 2015 Launchpad data

We have added Launchpad data under datasource_id 58458. Launchpad is a repository for projects affiliated with Ubuntu. Summer research assistant Gavan Roth wrote some scripts to collect this data.

--Download the flat files, or
--Access and query this data via the MySQL interface

Here is a query to show some of the data that is available:

Data Resources: 

New papers added to FLOSShub

I've uploaded metadata (and when available, actual papers) for some relevant 2015 open source conferences (including the 2015 OSS conference, HICSS, and Mining Software Repositories) to FLOSShub/biblio.

There are now 1589 papers on FLOSShub/biblio. It makes a nice addition/backup/source for Google Scholar and the other larger publishing sites.


Database reorganization

Today my new research assistant Gavan & I are performing some maintenance tasks on the database, including a reorganization of the places where the data tables live. Hopefully this will mean that the data is much better organized.

Here is the github summary of what we are doing, and a brief summary below.

We will leave old copies of the most popular tables for a few days, in order to give everyone time to rework scripts, etc.

Data Resources: 

Software Archaeology: GNUe IRC data & summaries

Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.

The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.

Data Resources: 

Pastebin use on FLOSS social media, and the downsides

A pastebin is a web site where developers can paste in some code, get back a URL, and then share that with others. The usage of pastebins is handy for IRC chat or in email, when a lot of source code will look ugly or be unformatted. However, the pastebin URLs disappear over time and this presents a problem for those of us who collect old data, or want to study the software evolution.

FLOSS as a source for insults

FLOSSmole is hosting the data from a new paper: FLOSS as a source for profanity and insults: Collecting the data

Get the data

Get the slides

Data Resources: 

Hearsay Culture radio show link

I appeared (?) on David Levine's radio show Hearsay Culture out of KZSU-FM (Stanford U.) today. 10am PST for the stream, or listen later on a podcast at