Database Schema June 2015

The PNG(graphic) files below display to current database schema for FLOSSmole. You can also find all PNG files and the MWB(MySQL Workbench) file in the zipped file below.

Each of the files below displays a different section in the database. Since there is a large amount of information in the database, this will allow users to be able to look at one of these files and quickly determine if that section in the database interests the user.

Database reorganization

Today my new research assistant Gavan & I are performing some maintenance tasks on the database, including a reorganization of the places where the data tables live. Hopefully this will mean that the data is much better organized.

Here is the github summary of what we are doing, and a brief summary below.

We will leave old copies of the most popular tables for a few days, in order to give everyone time to rework scripts, etc.

Renamed:
sf database to 'sourceforge' (remember this is only pre-2009 metadata)

Moved:
OSSMOLE_MERGED SCHEMA
al_* tables to 'alioth' schema
fm_* tables to 'freecode' schema
fsf_* tables to 'free_software_foundation' schema
gh_* tables to 'github' schema
gc_* tables to 'google_code' schema
lpd_* tables to 'launchpad schema
ow_* tables to 'objectweb' schema
rf_* tables to 'rubyforge' schema
sv_* tables to 'savannah' schema
tig_* tables to 'tigris' schema

Software Archaeology: GNUe IRC data & summaries

Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.

The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.

Here are some papers that reference the GNUe chat logs or the summaries:
--Zhou & Hovy (2005) Digesting virtual "geek" culture: the summarization of technical internet relay chats
--Elliott & Scacchi (2007) Free Software Development: Cooperation and Conflict in a Virtual Organizational Culture
--Ulthus & Aha. Multiparticipant chat analysis: A survey
--Sood, Mohamed, & Varma. Topic-focused summarization of chat conversations

Unfortunately, the group that put together the summaries ("Kernel Traffic") no longer has a web presence, and the summaries and original log files are no longer available at any of the locations those papers link to.

FLOSSmole to the rescue! Here are the files that have been reconstructed, using what we could find on the Wayback Machine (archive.org):
-1-original chat logs for GNUe
-2-original Kernel Traffic GNUe chat summaries
-3-text list of URLs for chat logs, taken from Archive.org (used to build #1 above)
-4-text list of URLs for XML summaries of the chat logs, taken from Archive.org (used to build #2 above)
-5-all source code for how I collected and parsed this data on Github
-6-all data loaded into the FLOSSmole database interface on MySQL server (get your username and password)

In the database (#6 in the list above), there are 6 tables:
--GNUeIRCLogs (the log files themselves)
--GNUeSummaryItems (the text & metadata for the weekly summary)
--GNUeSummaryMentions (all the people mentioned in each summary)
--GNUeSummaryPara (the paragraph summary text, links removed)
--GNUeSummaryParaQuote (the quoted text from the logs that made it into the summary itself as quoted text)
--GNUeSummaryTopic (the topic that the summarizer classified each summary into)

Pastebin use on FLOSS social media, and the downsides

A pastebin is a web site where developers can paste in some code, get back a URL, and then share that with others. The usage of pastebins is handy for IRC chat or in email, when a lot of source code will look ugly or be unformatted. However, the pastebin URLs disappear over time and this presents a problem for those of us who collect old data, or want to study the software evolution.

My former student Amber Smith (Elon, '14) and I wrote a paper investigating the rise of the pastebin as it is used in FLOSS mailing lists specifically. We wanted to know how they developed and whether their adoption curve followed classic Diffusion of Innovation theory. We also wondered how different types of pastebins evolved differently, and whether there was pushback about using them.

The paper was presented at OSS2015 in Florence May 16-17. Get the paper over at the FLOSShub.org/biblio!

IRC log updates: perl6, ubuntu, django

Hi moles! New IRC chat logs now cleaned and stored in the irc database on the FLOSSmole mysql server, thanks to Andrea Black, one of our intrepid FLOSSmole research assistants. This data is part of an overall IRC collection started by another student, Becca Gazda, last summer.

We now have the following IRC chat histories available:

Apache
--activemq
--aries
--camel
--cxf
--kalumet
--karaf
--servicemix

Wordpress
--bbpress
--buddypress
--buddypress-dev
--coreplugins
--dev
--events
--gsoc
--meta
--mobile
--polyglots
--sfd
--themes
--ui
--general irc

Openstack
--devdns
--infra
--meeting3
--meetingalt
--meeting
--irc

Django
Perl6 -- NEW
Ubuntu

Coming soon : Bitcoin

FLOSS as a source for insults

FLOSSmole is hosting the data from a new paper: FLOSS as a source for profanity and insults: Collecting the data

Get the data

Get the slides

Hearsay Culture radio show link

I appeared (?) on David Levine's radio show Hearsay Culture out of KZSU-FM (Stanford U.) today. 10am PST for the stream, or listen later on a podcast at HearsayCulture.com.

A few times in the show we referred to various things; here are the relevant links:
--some slides we were looking at
--a forthcoming paper about insults and profanity in FLOSS will be presented at HICSS in January 2015
--if you want to check out the data, we have flat files and database access available. There are a few datasources not yet cleaned enough for public consumption (mostly irc and email data) but you can get a sense of the types of data we have

Slides for All Things Open presentation

I'm presenting today at All Things Open in Raleigh on Why and How Researchers are Studying Open Source. My goal is to show a wide variety of papers (not necessarily the "best" or most oft-cited papers, but a variety of techniques and motivations) in order to give developers a high-level idea of what kind of research is happening with the open source artifacts that they create.

Slides

UPDATE: based on a cursory count of Twitter attention, it seems like the most popular part of the presentation was the findings from Guzman et al from MSR 2014 showing that Java code comments were the most negative and the code comments were more negative on a Monday. :)

New schema for IRC data

In my continuing quest to be organized, I've created a new schema to hold just the IRC log data. On the database server (access instructions here), there is a new schema called 'irc' and it includes (for now) Ubuntu logs, Django logs, 7 Apache projects, and the topic lines from Freenode for all channels with 3+ users.

Coming soon: email updates, including Linux Kernel Mailing List (LKML) and more IRC (Wordpress, etc).

Enjoy!

Freecode is no longer updating

Freecode (formerly Freshmeat), the directory of Free and Open Source Software Projects, is no longer accepting new submissions. As of June 18, their site has this message on top:

The last full scrape of the Freecode RDF files took place in March. The data for the March 2014 Freecode collection is available for download from the FLOSSmole FC data site (or in the MySQL database).

RIP Freecode, nee Freshmeat.

Syndicate content