December data released, some updates
December data has been released. But first, some informational updates.
1. Project_indexes (the table where sourceforge web pages are stored) is now set to use the UTF-8 encoding. This is to address the concerns about character sets and corruption in our storage of Sourceforge project home pages.
2. The "forges" table is now in use. We have 5 forges, as follows:
0 - TE - test
1 - SF - Sourceforge
2 - FM - Freshmeat
3 - RF - Rubyforge
4 - OW - Objectweb
The datasources table now shows which forge the datasource is pulled from. The join column between the "forges" and "datasources" tables is "forge_id".
3. December data has been released for Sourceforge (SF), Objectweb (OW), RF (Rubyforge), FM (Freshmeat). Get the files on Sourceforge as follows:
-- All projects, SF, FM, OW, RF link
-- Sourceforge, December link (Datasource_id = 38)
-- Freshmeat, December link (Datasource_id = 41)
-- Rubyforge, December link (Datasource_id = 39)
-- Objectweb, December link (Datasource_id = 40)
4. There was some trouble collecting the statistics pages this time from Sourceforge. It appears that the values for some of the fields in the html table on the statistics pages (i.e. "rank", "downloads", etc) are generated dynamically from SF's own database. When a person (or a spider) pulls up a given stats page in the browser, there is a chance that the page will generate WITHOUT some of the actual stats filled in.
What that looks like is something like this:
Instead of like this:
To fix this in the future, we've added a routine into the spider that quickly parses the stats page and makes sure that rank and downloads are NOT blank in the html. If they are blank, we will attempt to pull the page again.
To fix the bad stats pages for this run (datasource_id=38), we will re-run the 60-day stats and insert those into the db. Because we're not re-running these stats pages until a few days behind the rest of the run, those projects will have slightly different stats dates than the rest of the projects in run #38.
I'm sorry this seems so messy, but it reminds me that what we're doing here IS fairly messy, actually. So, errors within the SF page generation program are like adding insult to injury. :)
5. PLEASE, PLEASE, PLEASE USE THE TEXT FILES IF POSSIBLE, rather than the query tool. If you have a request for a data set that is not provided in the text dumps, let me know and I will run the query for you and add it so that everyone can benefit from it in the future.
1. Project_indexes (the table where sourceforge web pages are stored) is now set to use the UTF-8 encoding. This is to address the concerns about character sets and corruption in our storage of Sourceforge project home pages.
2. The "forges" table is now in use. We have 5 forges, as follows:
0 - TE - test
1 - SF - Sourceforge
2 - FM - Freshmeat
3 - RF - Rubyforge
4 - OW - Objectweb
The datasources table now shows which forge the datasource is pulled from. The join column between the "forges" and "datasources" tables is "forge_id".
3. December data has been released for Sourceforge (SF), Objectweb (OW), RF (Rubyforge), FM (Freshmeat). Get the files on Sourceforge as follows:
-- All projects, SF, FM, OW, RF link
-- Sourceforge, December link (Datasource_id = 38)
-- Freshmeat, December link (Datasource_id = 41)
-- Rubyforge, December link (Datasource_id = 39)
-- Objectweb, December link (Datasource_id = 40)
4. There was some trouble collecting the statistics pages this time from Sourceforge. It appears that the values for some of the fields in the html table on the statistics pages (i.e. "rank", "downloads", etc) are generated dynamically from SF's own database. When a person (or a spider) pulls up a given stats page in the browser, there is a chance that the page will generate WITHOUT some of the actual stats filled in.
What that looks like is something like this:
<tr bgcolor="#eaecef">
<td>05 Oct 2006</td>
<td></td>
<td>0</td>
<td></td>
<td>906,677</td>
<td></td>
<td></td>
</tr>
Instead of like this:
<tr bgcolor="#eaecef">
<td>05 Oct 2006</td>
<td>230</td>
<td>1,535</td>
<td>65</td>
<td>6,442</td>
<td>2 (0)</td>
<td>3</td>
</tr>
To fix this in the future, we've added a routine into the spider that quickly parses the stats page and makes sure that rank and downloads are NOT blank in the html. If they are blank, we will attempt to pull the page again.
To fix the bad stats pages for this run (datasource_id=38), we will re-run the 60-day stats and insert those into the db. Because we're not re-running these stats pages until a few days behind the rest of the run, those projects will have slightly different stats dates than the rest of the projects in run #38.
I'm sorry this seems so messy, but it reminds me that what we're doing here IS fairly messy, actually. So, errors within the SF page generation program are like adding insult to injury. :)
5. PLEASE, PLEASE, PLEASE USE THE TEXT FILES IF POSSIBLE, rather than the query tool. If you have a request for a data set that is not provided in the text dumps, let me know and I will run the query for you and add it so that everyone can benefit from it in the future.
- megan's blog
- Log in to post comments