Details about the repository collections
Submitted by megan on June 1, 2015 - 11:21am
Today my new research assistant Gavan & I are performing some maintenance tasks on the database, including a reorganization of the places where the data tables live. Hopefully this will mean that the data is much better organized.
Here is the github summary of what we are doing, and a brief summary below.
We will leave old copies of the most popular tables for a few days, in order to give everyone time to rework scripts, etc.
Submitted by megan on May 29, 2015 - 10:04pm
Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.
The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.
Submitted by megan on April 8, 2015 - 1:06pm
Hi moles! New IRC chat logs now cleaned and stored in the irc database on the FLOSSmole mysql server, thanks to Andrea Black, one of our intrepid FLOSSmole research assistants. This data is part of an overall IRC collection started by another student, Becca Gazda, last summer.
We now have the following IRC chat histories available:
Apache
--activemq
--aries
--camel
--cxf
--kalumet
--karaf
--servicemix
Submitted by megan on December 31, 2014 - 1:48pm
Submitted by megan on August 1, 2014 - 12:17pm
In my continuing quest to be organized, I've created a new schema to hold just the IRC log data. On the database server (access instructions here), there is a new schema called 'irc' and it includes (for now) Ubuntu logs, Django logs, 7 Apache projects, and the topic lines from Freenode for all channels with 3+ users.
Coming soon: email updates, including Linux Kernel Mailing List (LKML) and more IRC (Wordpress, etc).
Enjoy!
Submitted by megan on July 30, 2014 - 4:22pm
Freecode (formerly Freshmeat), the directory of Free and Open Source Software Projects, is no longer accepting new submissions. As of June 18, their site has this message on top:
The last full scrape of the Freecode RDF files took place in March. The data for the March 2014 Freecode collection is available for download from the FLOSSmole FC data site (or in the MySQL database).
RIP Freecode, nee Freshmeat.
Submitted by megan on May 15, 2014 - 1:19pm
The last Rubyforge collection happened yesterday. The datasource_id = 12987. All the data is located on our file downloads site, or in the database (ossmole_merged schema, tables prefixed 'rf', use datasource_id=12987 in your SQL queries).
RIP Rubyforge! We have been collecting from there for 10 years. Charts and graphs coming soon.
Submitted by megan on March 25, 2014 - 4:47pm
Django is a Python web framework. And of course it is an open source project. I have downloaded the entire collection of IRC logs for this project starting with the first logs from 2011. The logs are split into lines, parsed into fields (message, sender, time, date, etc) are now loaded into ossmole_merged database on our live MySQL server in a table called django_irc.
Each datasource_id represents one day's log file. Right now we have datasource_id 8442-9435.
Submitted by megan on March 11, 2014 - 11:57am
Some new forge data has been released collected 04-Mar-2014.
Datasource_id's are as follows:
8079 - freecode
8080 - rubyforge
8081 - objectweb
8082 - savannah
8083 - tigris
8084 - alioth
IRC data:
8085 - 8134: Apache ServiceMix
8135 - 8185: Apache Camel
8186 - 8236: Apache ActiveMQ
8237 - 8287: Apache CXF
8288 - 8338: Apache-Aries
8339 - 8389: Apache Kalumet
8390 - 8440: Apache Karaf
Submitted by megan on January 22, 2014 - 10:44am
Hello moles! Happy January. Here are some fresh new data sources for your mining pleasure:
1. Freenode channel list and topics (all public channels with 3 or more users). The table is called "fn_irc_channels".
2. Apache Activemq IRC logs (one datasource_id per day, one row per message).
3. Apache Aries IRC logs
4. Apache Camel IRC logs
5. Apache CXF IRC logs
6. Apache Karaf IRC logs
7. Apache Kalumet IRC logs
8. Apache Servicemix IRC logs
here is a sample of what the structure looks like for 2-8:
Pages