Appendix: Influences on Tag Choice on del.icio.us by Emilee Rader and Rick Wash (CSCW '08)

This web page contains all the appendix information for our CSCW '08 Paper Influences on Tag Choice on del.icio.us. Via this online appendix, we are providing access to our datasets, additional statistical results, and the code we wrote for our computer model and to conduct our analyses.

Statistical Results: Table 1 in the paper (the big table summarizing the logistic regression results) is just a summary of our logistic regressions. Our statistics software output a lot more diagnostic and substantive information. It is available here:

Code: We used the R programming language (as in the R Statistics software) for our logistic regression and interuser agreement analyses, and our computer models. R is available for all major platforms; you will need to download and install it before running our code. UCLA's Academic Technology Services has an excellent set of instructions and examples for installing and using R.

The following files are available (the top of each file includes information about its use):

Datasets: We will make both of our datasets available to those who would like to replicate and extend our analyses. This includes the data we scraped from del.icio.us, and the data generated by our computer models. These datasets are huge (over 2GB), and we are unable to host them permanently online. If you are interested, contact us by email in order to work out transfer arrangements: emilee@gmail.com and rick.wash@gmail.com. Serious inquiries only, please --- these are not general purpose datasets. The data were collected from del.icio.us using a sampling frame specific to our research, and therefore may not be suited to your research questions or objectives. We are providing the computer model data so that you may replicate our analyses exactly; subsequent runs of the model using the same parameters may not produce exactly the same numbers due to random number generation.

Computer Model: Our iConference 2008 paper goes into much greater detail about the specifics of the computer models and our design decisions. In particular, the random number generator was set up as follows:

Database Schema: The R files and perl script linked to above assume a database of information that has the same layout as ours. Here are the schema (from MySQL):

CREATE TABLE  `delicious`.`jan2007_site` (
  `id` int(11) NOT NULL auto_increment,
  `deliciousID` varchar(200) NOT NULL default '',
  `title` varchar(400) default NULL,
  `url` varchar(500) default NULL,
  `user` varchar(200) default NULL,
  `date` date default NULL,
  `position` int(11) default NULL,
  PRIMARY KEY  (`id`),
  KEY `deliciousID` (`deliciousID`),
  KEY `date` (`date`),
  KEY `position` (`position`),
  KEY `user` (`user`),
  KEY `id_date` (`deliciousID`,`date`)
)
and
CREATE TABLE  `delicious`.`jan2007_tag` (
  `id` int(11) NOT NULL auto_increment,
  `site_id` int(11) default NULL,
  `tag` varchar(200) default NULL,
  `position` int(11) default NULL,
  PRIMARY KEY  (`id`),
  KEY `tag` (`tag`),
  KEY `deliciousID` (`site_id`)
)
Also, for downloads of users' bookmarks, you also need this table:
CREATE TABLE  `delicious`.`jun2007_user` (
  `id` int(11) NOT NULL auto_increment,
  `user` varchar(200) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `user` (`user`)
)
and an extra column in the site table:
CREATE TABLE  `delicious`.`jun2007_site` (
  `id` int(11) NOT NULL auto_increment,
  `deliciousID` varchar(200) default NULL,
  `title` varchar(400) default NULL,
  `url` varchar(500) default NULL,
  `user_id` int(11) NOT NULL default '0',
  `date` date default NULL,
  `count` int(11) default NULL,
  `position` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `deliciousID` (`deliciousID`),
  KEY `date` (`date`),
  KEY `id_date` (`deliciousID`,`date`),
  KEY `user` (`user_id`)
)
And of course, the standard tag table.
Also, there are two more tables that are used for temporary storage (they will be filled by the code):
CREATE TABLE  `delicious`.`logdata_temp` (
  `deliciousID` varchar(200) default NULL,
  `user` varchar(200) default NULL,
  `prefix` varchar(50) default 'jan2007',
  `id` int(11) NOT NULL auto_increment,
  `finished` tinyint(1) NOT NULL default '0',
  PRIMARY KEY  (`id`)
)
and
CREATE TABLE  `delicious`.`logistic_data` (
  `id` int(11) NOT NULL auto_increment,
  `site` varchar(200) NOT NULL,
  `user` varchar(200) NOT NULL,
  `tag` varchar(200) NOT NULL,
  `chosen` tinyint(1) NOT NULL,
  `used.onSite` tinyint(1) NOT NULL,
  `used.byUser` tinyint(1) NOT NULL,
  `fromUserTags` tinyint(1) NOT NULL,
  `fromSiteTags` tinyint(1) NOT NULL,
  `position` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `site` (`site`),
  KEY `user` (`user`),
  KEY `tag` (`tag`),
  KEY `site_position` (`site`,`position`)
)