Software Archaeology: GNUe IRC data & summaries
Back in the 2000's, the GNU Enterprise (GNUe) project chat logs (and human-created chat log summaries!) were used by several papers in the area of text summarization, especially dialogue summarization.
The reason the GNUe chat logs and summaries were used is that the logs were accompanied by summaries that were compiled periodically (manually) by a human. The summarized chat logs can thus be considered a kind of "gold standard" for what kind of summary a machine summarizer should produce.
Here are some papers that reference the GNUe chat logs or the summaries:
--Zhou & Hovy (2005) Digesting virtual "geek" culture: the summarization of technical internet relay chats
--Elliott & Scacchi (2007) Free Software Development: Cooperation and Conflict in a Virtual Organizational Culture
--Ulthus & Aha. Multiparticipant chat analysis: A survey
--Sood, Mohamed, & Varma. Topic-focused summarization of chat conversations
Unfortunately, the group that put together the summaries ("Kernel Traffic") no longer has a web presence, and the summaries and original log files are no longer available at any of the locations those papers link to.
FLOSSmole to the rescue! Here are the files that have been reconstructed, using what we could find on the Wayback Machine (archive.org):
-1-original chat logs for GNUe
-2-original Kernel Traffic GNUe chat summaries
-3-text list of URLs for chat logs, taken from Archive.org (used to build #1 above)
-4-text list of URLs for XML summaries of the chat logs, taken from Archive.org (used to build #2 above)
-5-all source code for how I collected and parsed this data on Github
-6-all data loaded into the FLOSSmole database interface on MySQL server (get your username and password)
In the database (#6 in the list above), there are 6 tables:
--GNUeIRCLogs (the log files themselves)
--GNUeSummaryItems (the text & metadata for the weekly summary)
--GNUeSummaryMentions (all the people mentioned in each summary)
--GNUeSummaryPara (the paragraph summary text, links removed)
--GNUeSummaryParaQuote (the quoted text from the logs that made it into the summary itself as quoted text)
--GNUeSummaryTopic (the topic that the summarizer classified each summary into)