Lots of new data for you to peruse out on our FLOSSmole Data Downloads Page.
Here's what's out there, recently added:
Google Code, March 2010 (GC) - list of all GC projects donated by Audris Mockus (HUGE THANK YOU TO AUDRIS FOR THIS!!)
Freshmeat, February 2010 (FM)
Objectweb, February 2010 (OW)
Rubyforge, February 2010 (RF)
Github, February 2010 (GH)
Free Software Foundation, February 2010 (FSF)
Savannah, February 2010 (SV)
and Sourceforge from December 2009 (SF)
We have another set of bugs to fix with Sourceforge collection this year, 2010, but those are forthcoming. I'm running a collection now. Hopefully the data will be good. We may even have stats this time. Hallelujah.
Also, thanks to my phenomenal undergraduate superstar Steven Norris, Tigris is coming soon!! and Debian after that. We are rocking the repository collection...
After long delay, the December Sourceforge data has been released. You may recall that over summer 2009, SF redesigned their web site which broke many of our crawlers and all of our parsers.
We have re-written these, and with only a few exceptions, have pretty much the same data as we always had.
Here are some release notes:
1. The Datasource_id=206
2. Donors data is not available in the Dec 2009 release. Donors were moved to their own page, so we have to add this to the collection for next time.
3. Statistics data is not available in the Dec 2009 release. We accidentally collected the wrong stats pages, so we had to throw these out and re-write for next time.
4. Status data (alpha, beta, mature, etc) is not available in the Dec 2009 release. This information is still being collected and kept by SF, but we can't find where it's being reported on their web site. If you have any ideas, send them to the mailing list (ossmole-discuss@lists.sourceforge.net).
Files are located at our Google Code page: http://code.google.com/p/flossmole/downloads/list
For those of you with database access on the sdsc server, I'll get these files over there ASAP.
Description
This chart shows the top programming languages used by projects in Sourceforge.
Visualization

SQL Script
SELECT ppl.description, count(DISTINCT ppl.proj_unixname) AS lang
FROM project_programming_language ppl
WHERE ppl.datasource_id = <current>
GROUP BY ppl.description
ORDER BY lang DESC;
Description
This chart shows the top operating systems used by projects in Sourceforge.
Visualization

SQL Script
SELECT pop.description, count(DISTINCT pop.proj_unixname) AS system
FROM project_operating_system pop
WHERE pop.datasource_id = <current>
GROUP BY pop.description
ORDER BY system DESC;
Description
This chart shows the number of projects of each team size listed in Sourceforge.
Visualization
Projects listed as having NULL or 0 developers were disregarded (1432 and 1478 projects, respectively).

SQL Script
SELECT DISTINCT dev_count, count(DISTINCT proj_unixname) AS count
FROM projects
WHERE datasource_id = <current>
GROUP BY dev_count
ORDER BY count DESC , dev_count;
Description
This chart shows the number of new projects in each repository that use the words "Free" and "Open" in project names. (We ran the queries to make this chart in June. This means 2009 was not yet completed, so this explains the apparent drop-off for the 2009 numbers.)
Visualization

SQL Script
Sourceforge:
SELECT year(date_registered) , count(DISTINCT proj_unixname) FROM projects
WHERE proj_unixname LIKE "%free%"
AND datasource_id = <current>
GROUP BY year(date_registered)
ORDER BY year(date_registered) ;
SELECT year(date_registered) , count(DISTINCT proj_unixname) FROM projects
WHERE proj_unixname LIKE "%open%"
AND datasource_id = <current>
GROUP BY year(date_registered)
ORDER BY year(date_registered) ;
Freshmeat:
SELECT year(date_added), count(DISTINCT project_id) FROM fm_projects
WHERE projectname_full LIKE "%free%"
AND datasource_id = <current>
GROUP BY year(date_added)
ORDER BY year(date_added);
SELECT year(date_added), count(DISTINCT project_id) FROM fm_projects
WHERE projectname_full LIKE "%open%"
Description
This chart shows the number of projects at each repository that share URL's.
Visualization
Number of Projects at each Repository that List a Home Page at Another Repository
Description
This chart shows the number of projects at each repository that share project names.
Visualization
Number of Projects at each Repository that Share an Identical Short Project Name


This graph shows the number of short project names shared in common between each pair of projects. For instance, starfish is a project listed on both Sourceforge and Rubyforge. On Rubyforge, it is described as a "tool to make programming ridiculously easy", but on Sourceforge the starfish project is described as a password management application. There are 1367 projects with shared names on Rubyforge and Sourceforge.
For more information on matching project names and URLs, see:
Squire, M. (2009). Integrating projects from multiple open source code forges. International Journal of Open Source Software & Processes, 1(1). January-March 2009. pp. 46-57.
Description
This chart shows the number of NEW projects added to each repository by year.
Visualization

SQL Script
Sourceforge:
SELECT year(date_registered) , count(DISTINCT proj_unixname) FROM projects
WHERE datasource_id = <current>
GROUP BY year(date_registered)
ORDER BY year(date_registered);
Freshmeat:
SELECT year(date_added), count(DISTINCT project_id) FROM fm_projects
WHERE datasource_id= <current>
GROUP BY year(date_added)
ORDER BY year(date_added);
Rubyforge:
SELECT year(date_registered), count(DISTINCT proj_unixname) FROM rf_projects
WHERE datasource_id= <current>
GROUP BY year(date_registered)
ORDER BY year(date_registered);
Objectweb:
SELECT year(date_registered), count(DISTINCT proj_unixname) FROM ow_projects
WHERE datasource_id= <current>
GROUP BY year(date_registered)
ORDER BY year(date_registered);
Description
This chart shows the number of projects that FLOSSmole most recently collected from each repository.
Visualization

SQL Script
SELECT count(DISTINCT proj_unixname) FROM projects
WHERE datasource_id= <current>;
SELECT count(DISTINCT project_id) FROM fm_projects
WHERE datasource_id= <current>;
SELECT count(DISTINCT proj_unixname) FROM rf_projects
WHERE datasource_id= <current>;
SELECT count(DISTINCT proj_unixname) FROM ow_projects
WHERE datasource_id= <current>;
SELECT count(DISTINCT proj_num) FROM fsf_projects
WHERE datasource_id= <current>;