Direct DB access for FLOSSmole collection available
Hello moles,
I'm excited to give you all a heads up that the entire flossmole database is now available directly via a MySQL server.
We have transferred the database to the NSF TeraGrid Data Central hosting site [1] (based at the San Diego Supercomputing centre). It's a bigger machine and professionally administered, which was much better than we could offer ourselves. See below for access procedure.
The process of transferring the database also enabled us to prepare comprehensive datamarts for each datasource in the database. These are mysqldump files which can be used for local access to parts of the database; there are two for each datasource, one containing the raw html pages and one, substantially smaller, containing just the parsed data points. These will be available shortly and will be an option for those who want to install a local copy of the DB; although we'd be very interested in reasons people find to do that, we'd like to have people sharing useful transformations of the data and the Data Central database should be pretty quick.
So now we have three great options for accessing the FLOSSmole data:
1. The traditional monthly flat files
2. Direct MySQL access to the full database @ DC.
3. Comprehensive datamarts for local access
Database access further info
------------------------------
In order to demonstrate usage to NSF and to monitor run-away queries (hey, I write them myself. Often :) interested users need to contact the FLOSSmole project to request a personal username and password, which should not be shared. Other than that simple request, we're not introducing any new AUPs or conditions.
Initially requesters should join and email their request to the ossmole-discuss list, with a preferred username. We can review using the list that way if the traffic spikes. Turnaround should be no longer than a business day or two (we email the db admin at Data Central with the request).
OTOH when we, and hopefully you, publish workflows using the database, we would like them to work 'out of the box', without a potential user needing to request a user/pass. To enable this, in addition to the full database we are in the process of creating a small database, with very limited data in each table (~20 rows in each table, just demo data). This is to allow querying through a single, public, shared login which we urge people to use when publishing their workflows; once potential users wish to go beyond the sample data they should request their own user/pass and plug it into the workflow. We're still figuring out the best way to do this (finding 20 projects with total data coverage is actually quite hard :)
Hopefully this improves the accessibility of the datasets, and will likely result in finding more bugs; both from the migration [2] and within the dataset. We're asking people to file bugs and request for documentation in the Sourceforge Trackers; although discussing them on this list is always welcome as well.
So, have at it.
--J
[1]: http://datacentral.sdsc.edu/ . The Machine (thor) is on Internet2 and has >80G of RAM.
[2]: If anyone wants to chat about ways to confirm data integrity while migrating 300GB+ databases with some very large tables, ping me :) I think we got it sorted, with the much appreciated help of our Master's student, Vinay Venugopal.
I'm excited to give you all a heads up that the entire flossmole database is now available directly via a MySQL server.
We have transferred the database to the NSF TeraGrid Data Central hosting site [1] (based at the San Diego Supercomputing centre). It's a bigger machine and professionally administered, which was much better than we could offer ourselves. See below for access procedure.
The process of transferring the database also enabled us to prepare comprehensive datamarts for each datasource in the database. These are mysqldump files which can be used for local access to parts of the database; there are two for each datasource, one containing the raw html pages and one, substantially smaller, containing just the parsed data points. These will be available shortly and will be an option for those who want to install a local copy of the DB; although we'd be very interested in reasons people find to do that, we'd like to have people sharing useful transformations of the data and the Data Central database should be pretty quick.
So now we have three great options for accessing the FLOSSmole data:
1. The traditional monthly flat files
2. Direct MySQL access to the full database @ DC.
3. Comprehensive datamarts for local access
Database access further info
------------------------------
In order to demonstrate usage to NSF and to monitor run-away queries (hey, I write them myself. Often :) interested users need to contact the FLOSSmole project to request a personal username and password, which should not be shared. Other than that simple request, we're not introducing any new AUPs or conditions.
Initially requesters should join and email their request to the ossmole-discuss list, with a preferred username. We can review using the list that way if the traffic spikes. Turnaround should be no longer than a business day or two (we email the db admin at Data Central with the request).
OTOH when we, and hopefully you, publish workflows using the database, we would like them to work 'out of the box', without a potential user needing to request a user/pass. To enable this, in addition to the full database we are in the process of creating a small database, with very limited data in each table (~20 rows in each table, just demo data). This is to allow querying through a single, public, shared login which we urge people to use when publishing their workflows; once potential users wish to go beyond the sample data they should request their own user/pass and plug it into the workflow. We're still figuring out the best way to do this (finding 20 projects with total data coverage is actually quite hard :)
Hopefully this improves the accessibility of the datasets, and will likely result in finding more bugs; both from the migration [2] and within the dataset. We're asking people to file bugs and request for documentation in the Sourceforge Trackers; although discussing them on this list is always welcome as well.
So, have at it.
--J
[1]: http://datacentral.sdsc.edu/ . The Machine (thor) is on Internet2 and has >80G of RAM.
[2]: If anyone wants to chat about ways to confirm data integrity while migrating 300GB+ databases with some very large tables, ping me :) I think we got it sorted, with the much appreciated help of our Master's student, Vinay Venugopal.
- megan's blog
- Log in to post comments