The best way (if you don’t already know) of grabbing the StackOverflow dataset is to use BitTorrent. It’s more forgiving of bad connections, start/stop interruptions, and most clients will let you throttle the bandwidth so you don’t flood your network with traffic. Besides, how often do you get to use BitTorrent at work and rightly tell people it’s for research purposes? You can grab the latest torrent of the data here: https://archive.org/details/stackexchange.
First thing to note: this torrent contains all of the data for the entire StackExchange set of websites, of which StackOverflow is a subset. Your BitTorrent client should show the breakdown of all the files; make sure you you exclude non SO files to reduce how much data you need to download.
The second thing you’ll want is to setup a MySQL server instance on your machine. This guide is written with MySQL in mind; while much/most of this will be useful for other RDMS setups, there’s some specific configuration stuff that will only be applicable to MySQL.
MySQL’s community installer is what you want to use to install MySQL. You can get it here: https://dev.mysql.com/downloads/windows/installer/. You’ll want the MySQL DB software, as well as the Admin panel, and their Notifier. The Admin panel will be useful for just looking at tables and databases, while the Notifier provides an easier way to restart your server.
So far, nothing really important right? Well, there’s two things I learned the hard way:
1. When installing the MySQL DB service, you’ll be given some options on how you want to connect. You generally will be fine with using a localhost connection. However, you’ll be given the option of opening up a port in your firewall for the database. Don’t Do This unless you need to access this DB from remote machines. If you ever have a problem with the DB afterwards, such as your passwords not working, you’ll always be wondering if someone’s messing with your installation.
2. You need to change some of the default settings for MySQL’s configuration. I’ll update this later with those settings as I’m currently away from my work machine, and need to check exactly what those settings area. Whoops.
Last March, I was reviewing a paper for a workshop on mining software repositories that was part of a research competition: the idea was to see what researchers could come up with the most interesting new analysis on a provided dataset. For this last workshop, researchers were provided with the latest copy of StackOverflow’s data dump in XML form, courtesy of the StackExchange website. It’s a sizeable collection of data: 100 GB uncompressed.
Anyways, one of the papers I reviewed mentioned that they couldn’t figure out how to get the SO data into a database server. I didn’t think much about it at the time, until I started doing some research using data from StackOverflow and tried it myself.
Here’s the thing: if you’re working with databases regularly, the SO dataset probably presents no issues for you to load into your DB of choice, and this post has no value for you. If you’re a researcher who uses DBs as needed, or a student who is still new to this stuff, importing SO’s data is a bit of work. Searching around for help on this topic is frustrating: different people at different times have created tools and import scripts to handle all of this for you, but they all seem to go out of date over time. StackExchange releases a new data dump once a quarter I think, and everything I’ve found to date failed to work on the latest release (March 2015).
I’m going to detail in these next posts what I did, and what worked for me to replicate the StackOverflow database on a locally hosted MySQL server, as well as some of the problems you’re likely to encounter when you try this. Hopefully, this’ll help someone else who was in the same boat as me: just needing to load the data, so they can get on to doing the real work.
We’ve recruited our allotment of participants, and have successfully completed our study.
Thanks to everyone who volunteered, and especially to those who shared the study with others! I appreciated it.
[Updated: We’ve concluded the study, and are no longer accepting applications.]
I’m currently conducting a study with the School of Computer Science at McGill University on how individuals navigate a product review website when researching a purchasing decision. The goal is to understand how one’s navigation and exploration of a website reflects their priorities and preferences in products; I’m hoping this study will lead to some ideas for systems that anticipate what products, and specifically what kinds of product reviews, are most useful for you.
We’re recruiting adult participants residing in Canada to participate in a study in which how they navigate product reviews, and make decisions about competing products that they would possibly buy, are remotely observed online. To volunteer to participate, you will need to be fluent in English, have access to a computer in a quiet room that is equipped with a microphone, Skype (http://www.skype.com/), and Internet access.
To volunteer, you’ll be asked to complete a questionnaire describing your interests and level of knowledge with a consumer product category. We’ll be selecting volunteers from the pool of applicants to participate in an only study via Skype that will take at most 90 minutes. If we select you to participate, you will be compensated with a $30.00 CDN Amazon.ca gift card.
If you’re interested in more information, or to volunteer, please contact me (Bradley Cossette, e-mail: bradley DOT cossette AT mail DOT mcgill DOT ca) for more details. A PDF of the official advertisement is also available here: rebapplication_crinvestigation_advertisement.
On September 10th, 2014, I get to finally defend my PhD thesis. That’s a big part of why there hasn’t been much new activity on this site over the last few months.
I’ll write more when it’s over, and I’ve had a chance to process everything.
The last few weeks have been exceptionally busy, as I try to get my final thesis draft done for the examination committee’s review. Between working at McGill during the day, and then working on the thesis on the evenings and weekends, there hasn’t been much time for…well, anything. I’m hoping this is the last week of this.
This is a really cool post by Mike Krahulik of Penny-Arcade. In it, he describes how he borrowed an Occulus Rift Virtual Reality headset, and had his family try it out, along with his kids’ friends. There’s two videos embedded of different people trying out the VR headset on a simulated roller coaster ride. They’re pretty fun to watch.
The comments about kids loving it aren’t really anything new; kids really pick up new technology quickly, and can cope with interface issues that adults struggle with (I guess they’ve just learned fewer interaction paradigms then us, and it’s easier for them to just come in as a blank slate). Its the comments about Mike’s Dad that I found most interesting.
Mike describes his Dad as a non-gamer who grew up in a quasi-pre-TV era, and he really LOVED the VR experience. The video of him on a roller coaster is priceless, but also I think a very positive signal about where consumer-VR is at today. I’ve always found VR fascinating, but the technical challenges with it have mainly kept it to specialized applications and cool tech demos. Maybe the Occulus Rift is really poised to bring it into the mainstream. The video of a Dad enthralled by the experience is probably a pretty good indicator.
My wife reminded me today that it was a year ago that we had to evacuate our house in Calgary as the Bow River reached a 100-year high water mark. Our entire neighborhood was flood that night, with the blocks around our place in Sunnyside the hardest hit.
I remember standing on a bridge over the Bow River in complete denial about what I was seeing, and then rushing back to our house to get as many things out of our basement and main floor on to the upper levels as possible. We were fortunate that my parents lived in Calgary in an unaffected area, and we were able to stay with them for the following week. In an amazing miracle, our landlord had sold the house we were renting, so we’d lined up a new place to live in July before the flood came. I can’t imagine trying to find accommodation after the flood.
We lost some things, and went through a pretty stressful time of cleaning, salvaging, and moving, but we were spared some of the worst of it. Before, when I saw pictures of flooding in other cities and places, it was a bit of an abstract disaster: it’s just water, right? Now, I’ll never forget the smell of a house post-flood.
For a long while I’ve been interested in law, especially intellectual property and copyright law. I think the interest started around the time the RIAA was beginning to file mass lawsuits against people who were downloading music, the EFF was getting involved in their defense, and Lawrence Lessig was still at Stanford, writing about the problems of permanent copyright. Copyright and IP law in general were interesting because I was planning to be a software developer, and the laws seemed out of touch with what was happening.
Before I started my MSc., this interest caused me to seriously considered pursuing a law degree finishing my BSc., in part because of the above. But mainly, it seemed that the market for software developers was…well, not great in the long term. Some of the long-term pressures of outsourcing, and the companies I knew of that were contracting out development of core products, were not encouraging for a new graduate. A software company I worked for, who was not only the market-leader for their flagship product but was also doing VERY well financially, was still discussing outsourcing some development to other countries just before I left. Meanwhile, the law profession seemed to be a secure future (I mean, look at all those TV shows!): you had to be admitted to the local bar to practice which limited international competition, and very few lawyers were then practicing what’s now referred to as cyber-law, and hardly any had much of a software development background. I thought I could do pretty well there for a while, if I wanted.
Obviously I didn’t go down that road, but the reason I bring this up was this post on Stephen Bainbridge’s blog. He’s a UCLA corporate law professor that sometimes post things of broader interest (I started reading his site regularly after noticing: (a) he does very good wine reviews; and, (b) he occasionally posts very interesting recipes. No idea on his standing as a lawyer, but the man can cook). The post is part summary, part commentary on another post on legal education in the US.
What was interesting to me was seeing that the law profession in the US is starting to face some of the pressures that software developers have been facing for decades now, particularly the ability to outsource legal work to other countries for less, and websites offering flat fees for piece-meal work. I used to think that the bar admission system in both Canada and the US protected lawyers from this. Apparently not any more.
Recruiting people to take part in your pilot study can be hard sometimes. Usually, if you’ve been working in a lab, or office for a while, you’ve built personal connections so that your lab-mates or co-workers are happy to help you out.
If you’re starting in a new place, you don’t necessarily have that. But you do have the opportunity to create a reputation from scratch, that will follow you for a while. For example, being the guy who uses gourmet donuts to recruit for pilot studies…
I picked these up from Léché Desserts in Montreal. Wonder how they compare to Jelly Modern back in Calgary?
Strawberry cheesecake and S’mores donuts
Graham marshmallow and double brownie chocolate donuts
Coconut, and I’m not sure what these other two flavours are. But they’re donuts.
Filed under Food, Research