megan's blog

RubyGems data updated June 2016


Hello moles, the latest RubyGems data has been collected. We now have two RubyGems collections:

  • 61240: November 2015
  • 61243: June 2016

The data can be found in two places:

Tables include:

  • rubygems_project_authors (the author(s) listed for each gem/project)
  • rubygems_project_create_dates (earliest known release date for a gem/project)
  • rubygems_project_devdep (development dependencies for each gem/ project)
  • rubygems_project_facts (basic project metadata scraped from project page)
  • rubygems_project_links (the list of links provided by each gem/project, ex: home page, documentation, etc)
  • rubygems_project_owners (the owner(s) listed for each gem/project)
  • rubygems_project_pages (the html and rss where we got this data; one per gem, per datasource_id)
  • rubygems_project_rtdep (runtime dependencies for each gem/project)
  • rubygems_project_versions (each time the gem/project is released, it creates a new version)

Bitcoin-dev, Ubuntu, Perl6, Django, Puppet IRC logs are updated

Thanks to the work of my two summer research assistants Evan Ashwell & Greg Batchelor, the IRC channels for #bitcoin-dev, perl6, #ubuntu, #django, and puppet (#gen, #dev, and #razor) have been updated.

Things to know:

  • These IRC chats are only available on the FLOSSmole MySQL database server (how to get access) and not as flat files. Why? Well, they started out as flat files, so we don't want to just re-host flat archives. The original flat files are available for Puppet (puppetlogs.com), Bitcoin-dev (bitcoinstats.com), Ubuntu (Ubuntu Logs), Perl6 (Perl6 logs), and Django (Django IRC logs)
  • The data model is one day = one datasource id
  • The chat logs have been divided into the following columns (some logs have fewer columns):
    • datasource_id
    • line_num
    • line_message
    • type
    • send_user
    • date_of_entry
    • time_of_entry
    • unix_time
    • last_updated
  • An example row looks like the following:
    • 61835
    • 42
    • ah thanks. I'll search.
    • message
    • arubi
    • 2016-05-28
    • 21:38:00
    • l1464471492.0
    • 2016-06-02 13:03:3

All paper metadata from OSS 2016 are in FLOSSpapers.org

All metadata & citations for papers from OSS 2016 have been uploaded to FLOSSpapers.org. As I find pre-prints available online I will add those too. If you are an author and you wish to have FLOSSpapers host your pre-print, just email it to me (msquire at elon dot edu) and I'll add it to the site!

Mining Software Repositories '16 paper, slides & data

This weekend I'll be presenting at Mining Software Repositories 2016 in Austin, TX. My talk is in the data sets track, and it is entitled Data Sets: The Circle of Life in Ruby Hosting, 2003-2015(PDF). Here are the slides. And here are the quick links to the flat data: RF and RG.

RubyGems.org collection, Nov 2015

We have added RubyGems.org data under datasource_id 61240. RubyGems.org is the official gem host for Ruby projects.

The scripts we used to collect this data are available on Github and the SQL dumps are available on our data server. Direct database access is also available. Existing database users were given access to this new database on the MySQL server, called 'Rubygems'.

OSS 2016 The 12th International Conference on Open Source Systems

The International Conference on Open Source Systems (OSS) is a long-standing international forum for researchers, practitioners from business and industry, enthusiasts, and students to present and discuss the latest trends, experiences, and concerns in the field of Free/Libre Open Source Software.

The 12th OSS Conference will take place in will take place in the city of Gothenburg, in 30 May - 02 June 2016.

Call for conributions

IFIP WG 2.13University of GothenburgChalmers University of Technology

August 2015 Launchpad data

We have added Launchpad data under datasource_id 58458. Launchpad is a repository for projects affiliated with Ubuntu. Summer research assistant Gavan Roth wrote some scripts to collect this data.

--Download the flat files, or
--Access and query this data via the MySQL interface

Here is a query to show some of the data that is available:

SELECT programming_language, COUNT( * )
FROM `lpd_programming_languages`
WHERE datasource_id =58458
GROUP BY 1
ORDER BY 2 DESC ;

New papers added to FLOSShub

I've uploaded metadata (and when available, actual papers) for some relevant 2015 open source conferences (including the 2015 OSS conference, HICSS, and Mining Software Repositories) to FLOSShub/biblio.

There are now 1589 papers on FLOSShub/biblio. It makes a nice addition/backup/source for Google Scholar and the other larger publishing sites.

Please send information about your own papers if you'd like them listed. I will be happy to take any citation - as long as it is about open source software - and list it. You can send DOI, Bibtex, or just a regular citation in any format.

Database reorganization

Today my new research assistant Gavan & I are performing some maintenance tasks on the database, including a reorganization of the places where the data tables live. Hopefully this will mean that the data is much better organized.

Here is the github summary of what we are doing, and a brief summary below.

We will leave old copies of the most popular tables for a few days, in order to give everyone time to rework scripts, etc.

Renamed:
sf database to 'sourceforge' (remember this is only pre-2009 metadata)

Moved:
OSSMOLE_MERGED SCHEMA
al_* tables to 'alioth' schema
fm_* tables to 'freecode' schema
fsf_* tables to 'free_software_foundation' schema
gh_* tables to 'github' schema
gc_* tables to 'google_code' schema
lpd_* tables to 'launchpad schema
ow_* tables to 'objectweb' schema
rf_* tables to 'rubyforge' schema
sv_* tables to 'savannah' schema
tig_* tables to 'tigris' schema

Software Archaeology: GNUe IRC data & summaries

Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.

The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.

Here are some papers that reference the GNUe chat logs or the summaries:
--Zhou & Hovy (2005) Digesting virtual "geek" culture: the summarization of technical internet relay chats
--Elliott & Scacchi (2007) Free Software Development: Cooperation and Conflict in a Virtual Organizational Culture
--Ulthus & Aha. Multiparticipant chat analysis: A survey
--Sood, Mohamed, & Varma. Topic-focused summarization of chat conversations

Unfortunately, the group that put together the summaries ("Kernel Traffic") no longer has a web presence, and the summaries and original log files are no longer available at any of the locations those papers link to.

FLOSSmole to the rescue! Here are the files that have been reconstructed, using what we could find on the Wayback Machine (archive.org):
-1-original chat logs for GNUe
-2-original Kernel Traffic GNUe chat summaries
-3-text list of URLs for chat logs, taken from Archive.org (used to build #1 above)
-4-text list of URLs for XML summaries of the chat logs, taken from Archive.org (used to build #2 above)
-5-all source code for how I collected and parsed this data on Github
-6-all data loaded into the FLOSSmole database interface on MySQL server (get your username and password)

In the database (#6 in the list above), there are 6 tables:
--GNUeIRCLogs (the log files themselves)
--GNUeSummaryItems (the text & metadata for the weekly summary)
--GNUeSummaryMentions (all the people mentioned in each summary)
--GNUeSummaryPara (the paragraph summary text, links removed)
--GNUeSummaryParaQuote (the quoted text from the logs that made it into the summary itself as quoted text)
--GNUeSummaryTopic (the topic that the summarizer classified each summary into)

Syndicate content