Two new data sets

Hi moles! I've got two new datasets for you to play with. These aren't perfect, but they're a start of a new type of dataset for FLOSSmole!

(1) Apache Roles: This dataset stores information about people affiliated with all the subprojects of the Apache Software Foundation, their roles, and what project they're working on with that role. Data sources include: Apache web site pages, board meeting minutes, etc. (Pre-Print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Real_name John Smith
Project_name Apache Axiom
Role_on_project Committer
Organization BigCorp
Email jsmith@bigcorp.com
Web_site http://www.apache.org/~jsmith
Datasource_id 367
Details Appendix T

(2) Apache Twitter Screen Names: This dataset stores the verified twitter screen names of people affiliated with the Apache Software Foundation projects. Useful for matching to emails or source code commits, or to be used in tandem with the Apache roles dataset above. (Pre-print on FLOSShub describing collection, curation, storage, sample queries)

Sample data:

Column Sample Row
Svn_id jsmith
Twitter_screen_name jsmith
Real_name John Smith
Project_name Apache Cayenne
Details

Get the MySQL dumps on our FLOSSmole downloads page on Google Code or via direct database access.

Got a cool FLOSS-oriented dataset you want to share? If you wish to donate data to FLOSSmole, we can host it.

FLOSSdata move is complete

Hello moles! I have completed the move of all our data from the Teragrid over to FLOSSdata and it is ready to go.

There are some new things to know:

1. New schemas for 'old', 'sf' and 'udd' data. You're used to seeing those in the 'ossmole_merged' schema, but they've been moved out into their own schemas.

2. Reminder that the 'sf' data is quite old and we recommend that you use SRDA instead.

3. There are some new tables coming for Apache data. More information on these will be forthcoming.

4. Github was updated with 3.5 million new repos which means a lot of possibilities for those of you mining GH! The datasource_id for those is 359.

5. If you still need a username / password on the new system, let me know (megansquire at gmail) and I will set you up.

New MySQL user accounts

Hi moles,

This message is for those of you who are actively using direct database access on the MySQL server at Teragrid/SDSC.

Our direct access database is being moved off Teragrid at SDSC and onto our own servers at flossdata.syr.edu.

To accomplish this move, I'm going to need to re-create our active user accounts on MySQL.

(This is a good opportunity to clean house anyway since there were a number of user accounts on there for people who have moved on or are no longer active.)

Please respond to this email (send to megansquire) IF you still require direct database access, and let me know the username you were using, if possible. I will recreate your user account and let you know the connection details.

Remember the flat files are still available at: http://code.google.com/p/flossmole/downloads/list and no login is required to use these.

Google Code data released for Nov 2012

The datasource ID is 350, and we've got a fresh run of Google Code project names, thanks to Audris. Also the People collector is fixed up now, so we're getting both the userid and the user name if available. So note the new column in the people table.

Go here to get the data files

November 2012 data released

The November 2012 data has been released to Google Code. Go get it!

Note: we are in the process of copying from the Teragrid, so we will not be putting any more data up there for the time being. More details on direct database access will be forthcoming.

september Launchpad data is released

Huge thank you to our former student and flossmole alumnus Christian Funkhouser for working on re-writing the Launchpad collector to use the API.

Get the code!

The datasource_id for Launchpad in September is 342!

September 2012 data released

Data files have been released for September 2012. Go check it out on our Google Code downloads page or sign up for direct database access.

Special release notes:
--The Google Code, Github, and Launchpad collections are not included this month.
--Alioth is back, and Tigris is back, including emails.

August 2012 data released

Data files have been released for August 2012. Go check it out on our Google Code downloads page or sign up for direct database access.

Special release notes:
--The Google Code, Github, and Launchpad collections are not included this month.

Freecode New Project Registrations (1998-2011) and language tags

This chart shows the new project registrations for each year 1998-2011, and what programming language those projects were tagged with.

For example, 2003 was the highest year for new "C" projects to be registered with Freecode (then called Freshmeat).

(Couple of caveats about the data here: (1) A project could be tagged with one language when it's created, and then the tag could change later. For example a project may have been created as "C" and then later switched its tags to "Java". This chart will show whatever the most current language tag is for that project. I have not calculated the number of times this happens or if it is likely, but I suspect that it is not too common. (2) The languages in the legend are in order of the total number of times they exist as tags for any project. (3) Not every project has a language tag for itself.)

Here is the SQL code used to generate the data sets for this graph:

SELECT YEAR(p.date_added ) , COUNT( DISTINCT p.project_id )
FROM fm_projects p
INNER JOIN fm_project_tags t
ON t.project_id = p.project_id
WHERE p.datasource_id = 316
AND t.datasource_id=316
AND t.tag_name="C"
GROUP BY 1
ORDER BY 1;

Substitute your current datasource_id and any valid programming language tag. I used the following:

C
Java
C++
PHP
Perl
Python
JavaScript

These are in order of total number of times they appeared in the sitewide tag list over time.

Other top languages, in order of frequency the tag is used:

Unix Shell
XML
SQL
HTML
Ruby
Tcl

Syndicate content