megan's blog

How do various corporations populate the Apache projects?

With the FLOSSmole Apache Project/Contributor/Roles data we updated earlier today, we thought an interesting initial analysis would be to figure out how various corporations populate the Apache projects (at least according to the official lists of contributors posted on each Apache project page).

Here is a list of the Apache projects with the highest density of participation by a single corporation:


We only show the first page of results here.

How did we get this result?

1. We used the Apache Contributor Data Set described in this previous FLOSSmole blog posting. Each project in Apache family of projects lists their members, and sometimes they list what company that person works for. Here is an example from the Geronimo Project.

Not every project lists its members, and not every project lists its members' affiliations.

2. We limited our analysis to datasource_id 65935, or May 18, 2016

3. We created a view in SQL so that we could more easily calculate the percentage of the total number of developers for each project:

SELECT project_name, count(*) as 'devcount'
FROM apache_people_projects
WHERE real_name IN (
SELECT distinct(real_name) FROM `apache_people_projects` WHERE datasource_id=65935
ORDER BY `apache_people_projects`.`real_name`)
AND datasource_id=65935
GROUP BY 1 ORDER BY 1 ASC;

4. Then we ran this SQL query to generate the data shown in the table above. The rows are sorted by the highest percent.

SELECT app.project_name, app.organization, count(*) as 'org dev count', app2.devcount as 'all devs', cast((count(*)/devcount)*100 as decimal(4,2)) as 'pct of team' FROM apache_people_projects app
INNER JOIN apache_project_dev_count_65935 app2
ON app.project_name = app2.project_name
WHERE app.real_name
IN (SELECT distinct(real_name) FROM `apache_people_projects` where datasource_id=65935)
AND app.organization IS NOT NULL
AND app.organization !=""
AND app.datasource_id=65935
GROUP BY 1, 2
ORDER BY 5 DESC, 1 asc;

Interested in getting this data? Apache Contributor Data Set

Want to see more examples of how to use FLOSSmole data? Examples

New "Apache Projects & Contributors" data dump

I spent a few days in May updating the list of all the Apache project contributors (full name & Apache system name when available) and their organizations when available. This data set was first released in 2013 in the MSR paper entitled "Project Roles in the Apache Foundation: A Data Set".

Fields:

  • svn_id
  • real_name
  • web_site
  • datasource_id
  • project_name
  • role_on_project
  • details
  • email
  • organization
  • timezone
  • last_updated

here is a sample of what the data looks like:


click to enlarge

Most of the fields are nullable since many times the data is incomplete.

Download the flat file, or use the live database ('apache') on the FLOSSmole MySQL server.

RubyGems data updated June 2016


Hello moles, the latest RubyGems data has been collected. We now have two RubyGems collections:

  • 61240: November 2015
  • 61243: June 2016

The data can be found in two places:

Tables include:

  • rubygems_project_authors (the author(s) listed for each gem/project)
  • rubygems_project_create_dates (earliest known release date for a gem/project)
  • rubygems_project_devdep (development dependencies for each gem/ project)
  • rubygems_project_facts (basic project metadata scraped from project page)
  • rubygems_project_links (the list of links provided by each gem/project, ex: home page, documentation, etc)
  • rubygems_project_owners (the owner(s) listed for each gem/project)
  • rubygems_project_pages (the html and rss where we got this data; one per gem, per datasource_id)
  • rubygems_project_rtdep (runtime dependencies for each gem/project)
  • rubygems_project_versions (each time the gem/project is released, it creates a new version)

Bitcoin-dev, Ubuntu, Perl6, Django, Puppet IRC logs are updated

Thanks to the work of my two summer research assistants Evan Ashwell & Greg Batchelor, the IRC channels for #bitcoin-dev, perl6, #ubuntu, #django, and puppet (#gen, #dev, and #razor) have been updated.

Things to know:

  • These IRC chats are only available on the FLOSSmole MySQL database server (how to get access) and not as flat files. Why? Well, they started out as flat files, so we don't want to just re-host flat archives. The original flat files are available for Puppet (puppetlogs.com), Bitcoin-dev (bitcoinstats.com), Ubuntu (Ubuntu Logs), Perl6 (Perl6 logs), and Django (Django IRC logs)
  • The data model is one day = one datasource id
  • The chat logs have been divided into the following columns (some logs have fewer columns):
    • datasource_id
    • line_num
    • line_message
    • type
    • send_user
    • date_of_entry
    • time_of_entry
    • unix_time
    • last_updated
  • An example row looks like the following:
    • 61835
    • 42
    • ah thanks. I'll search.
    • message
    • arubi
    • 2016-05-28
    • 21:38:00
    • l1464471492.0
    • 2016-06-02 13:03:3

All paper metadata from OSS 2016 are in FLOSSpapers.org

All metadata & citations for papers from OSS 2016 have been uploaded to FLOSSpapers.org. As I find pre-prints available online I will add those too. If you are an author and you wish to have FLOSSpapers host your pre-print, just email it to me (msquire at elon dot edu) and I'll add it to the site!

Mining Software Repositories '16 paper, slides & data

This weekend I'll be presenting at Mining Software Repositories 2016 in Austin, TX. My talk is in the data sets track, and it is entitled Data Sets: The Circle of Life in Ruby Hosting, 2003-2015(PDF). Here are the slides. And here are the quick links to the flat data: RF and RG.

RubyGems.org collection, Nov 2015

We have added RubyGems.org data under datasource_id 61240. RubyGems.org is the official gem host for Ruby projects.

The scripts we used to collect this data are available on Github and the SQL dumps are available on our data server. Direct database access is also available. Existing database users were given access to this new database on the MySQL server, called 'Rubygems'.

OSS 2016 The 12th International Conference on Open Source Systems

The International Conference on Open Source Systems (OSS) is a long-standing international forum for researchers, practitioners from business and industry, enthusiasts, and students to present and discuss the latest trends, experiences, and concerns in the field of Free/Libre Open Source Software.

The 12th OSS Conference will take place in will take place in the city of Gothenburg, in 30 May - 02 June 2016.

Call for conributions

IFIP WG 2.13University of GothenburgChalmers University of Technology

August 2015 Launchpad data

We have added Launchpad data under datasource_id 58458. Launchpad is a repository for projects affiliated with Ubuntu. Summer research assistant Gavan Roth wrote some scripts to collect this data.

--Download the flat files, or
--Access and query this data via the MySQL interface

Here is a query to show some of the data that is available:

SELECT programming_language, COUNT( * )
FROM `lpd_programming_languages`
WHERE datasource_id =58458
GROUP BY 1
ORDER BY 2 DESC ;

New papers added to FLOSShub

I've uploaded metadata (and when available, actual papers) for some relevant 2015 open source conferences (including the 2015 OSS conference, HICSS, and Mining Software Repositories) to FLOSShub/biblio.

There are now 1589 papers on FLOSShub/biblio. It makes a nice addition/backup/source for Google Scholar and the other larger publishing sites.

Please send information about your own papers if you'd like them listed. I will be happy to take any citation - as long as it is about open source software - and list it. You can send DOI, Bibtex, or just a regular citation in any format.

Syndicate content