Blogs

January data update

Freshmeat, Objectweb, Rubyforge, Savannah, FSF, Tigris - all done, waiting for release

Github collector is broken, have student working on fixing that now. They suddenly made it hard to seed the initial project list so we're trying to figure out a way to get the entire corpus of projects. I'm getting flashbacks of when SF got really big and made it difficult for everyone to work with.

Google Code is plugging along. We're on the 7th out of 8 collection processes. Won't be too much longer on that.

Debian data is being cleaned for release. Have a meeting with a student tomorrow to see the status of that cleaning but it should be released this weekend.

New features in the hopper: (1) auto-generating DOI information at the time of release so we won't be behind, (2) federated search - I know that is huge, but we'll see how it goes.

Teragrid backed up for December

Hello moles! All Sept-Dec data has been backed up to Teragrid. (I had to re-write the backup scripts because the ones we had no longer worked, plus they took forever anyway. The backup is much faster now so I will be able to do it more frequently.)

You should see the new datasources in there (228-235). This includes the Google Code - which is HUGE.

Enjoy!

Google Code Sept 2010 data out

Whew! Google Code is collected, parsed, and released. Backup to Teragrid is happening now (just as soon as I solve this little issue of"disk quota exceeded" - fun!). In the meantime, go to the FLOSSmole Google Code Downloads Page and get your hot fresh data.

Remember, the files marked "datamarts" are SQL code you can use to make your own version of the database. The raw delimited files are marked .txt.bz2. The datasource_id for Google Code this release is 235.

September 2010 data trickle

Just released Github data for September today. This took about 10 days to collect, parse and release. You can download the files here (along with Freshmeat, Rubyforge, Objectweb, Free Software Foundation, Savannah, Tigris, etc), or wait for the next Teragrid backup if you want direct database access. (Google Code is collecting now, and upon completion of that, I'll do a final September Teragrid backup.)

Code backed up to Teragrid

Hello moles,

We have new data backed up to Teragrid. Here is what is included:

Google Code - metadata on projects, developers, issues, etc. Plus HTML.
Github - metadata. Plus XML.
Launchpad - metadata on projects, groups, developers, wiki. Plus HTML.
Tigris - discussions! messages! project metadata. Plus HTML.

Plus all the other forges you've known and loved for so many years: fm, ow, rf, fsf, sv, etc etc.

Remember to check the datasources table first to get the appropriate run number, as each table contains multiple "collections" from that forge.

If you need a username follow the procedures here.

New collection starts today (September) and once that's finished (give it a few weeks) that data will be up in TG also.

Enjoy, and happy digging!

New Google Code Data Released

Hello moles! I've released a new set of Google Code project data to our own downloads page (on Google Code, no less!) - the datasource_id is 226.

This data took over a month to collect. Included are the following:

--project names (info)
--project license, code and content (info)
--project summary (info)
--project description (info)
--project activity level (info)
--who works on what project and what their role is (people)
--what blogs are listed for each project (blogs)
--what links are listed for each project (links)
--what labels are used to describe each project (labels)

Crawlers vs API

Interesting article by some folks at 80Legs about crawling the web versus using API to gather data. On several occasions we've chosen to use an API rather than crawling. This pretty much summarizes the limitations around that choice.

Launchpad data released for June 2010

Introducing a new data source: Launchpad data. In this collection, Launchpad has about 19k projects in it and about 34k developers.

mysql> select count(*) from lp_projects where datasource_id=227;
18956

mysql> select count(distinct dev_loginname) from lp_developer_indexes where datasource_id=227;
34051

Available on our Google Code downloads page: Launchpad data

Github data released for May 2010

Data has been released for Github for May 2010. It is on our FLOSSmole Google Code downloads page.

Database Schema updated

The database schema page here on flossmole.org has been updated. I've got a single-page PNG of the schema, and an MWB file for those of you who like MySQL Workbench.

Syndicate content