Today I participated in the 2020 East Coast Datathon hosted by Citadel and CorrelationOne. The Datathon is an annual competition held by Citadel and CorrelationOne to find undergraduate and graduate students who would be a good fit at Citadel and Citadel Securities. You have to apply to even compete, and you have to complete an hour long technical test (which I 100% failed), and they invite probably around 10% of the applicants to an all-expenses-paid competition. This year, the East Coast Regional Datathon was being held in New York City, so I signed up and took the test for the fun of it. To my surprise, I received an invitation (which prompted me to force my other friends to also apply and take the test, since they were apparently letting any old person in). I RSVP’d, even though I had zero experience with data analysis, visualization, or wrangling, because, well, there would be free food. I almost didn’t go because my fear of bringing my future team down with my severe lack of knowledge was very very close to outweighing my desire for free food. Hoang was also visiting this week, and I wanted to spend more time with him before he returned to South Carolina. It was a really difficult decision, but I ended up deciding to participate.
The night before the datathon, Citadel and CorrelationOne hosted a mixer at PJ Clarke’s. There were about 40 extremely smart but slightly asocial undergraduate and graduate students standing around, some dressed in full suits, others in jeans and sneakers (I was in between, wearing khakis and a jacket). A gaggle of them was listening intently to a Citadel engineer, in hopes that if they were good enough at listening, they might get hired on the spot. I was in front of the cheese and crackers table, piling it on. Waiters and waitresses also circled the place, offering fancy raw tuna tacos, avocado toast, sliders, square slices of thin crust pizza, and drinks. All the other participants politely refused the hors d’oeuvres; they were grace, they were beauty. I was shameless; when a waitress came up to me with a plate of avocado toast when my hands were full with plates of food, she actually placed the piece of toast onto my already full plate.
The mixer was a chance for the un-wanted (me) to find a team. I spent some time talking to a PhD student at Columbia, and I offered to be on his team, but he just laughed and didn’t say anything. I eventually met two students, one from NYU and the other from CMU (both graduate students), who recruited me to be on their team. They were not try-hards, they said. They were legitimately there for the good times. And that was perfectly okay with me; I could not have asked for a better team. Henry is studying computational finance at Carnegie Mellon University; Ru just graduated with a degree in data science from NYU. I felt bad for Ru, because she confided that she joined the competition because she had nothing else to do; she could not return home to China because of the flight ban caused by the coronavirus. Halfway through the mixer, the MC easily got our attention (not much talking was going on) and revealed the theme for this year’s datathon: CitiBike data. They would give us loads and loads (gigabytes and gigabytes for you nerds) of data, and we had to pose an interesting question and answer it using the data. We brainstormed a couple of potential questions, and talked about how crazy it would be if we won. At a natural lull in the conversation, Henry and Ru decided that they were going to head home. After they left, I reconnected with the PhD student I talked to first, and he had recruited a fourth teammate. Then I left too.
I didn’t leave to go home though – I went straight to Anne’s apartment, where she was hosting a pregame before going out to a club in K-Town with Hoang. I had a mixed (?) drink of vodka and red bull, neither of which would help me in the datathon, but I finished it anyway. Rules is rules.
The next morning, I woke up after 4 hours of sleep (caffeine at midnight isn’t exactly Nyquil), and took the subway over to Convene, where the datathon was being held. The schedule was as follows:
- 8:00 am – Arrival/Check-in and Breakfast
- 8:30 am – Real Data Released, Hacking begins
- 11:00 am – Lunch served
- 3:30 pm – Final papers due
- 6:00 pm – Results and Goodbyes
We started at 8:30, when the data was released. We were given a couple year’s worth of CitiBike data (ride duration, start location, end location, subscriber, etc), the same data but for a bike sharing platform in Boston, the same in San Francisco, randomly sampled Yellow Medallion Cab data, Green Cab data, ride-sharing app data, NY MTA (subway) data, demographic data, NYC neighborhood data, and weather data. We had a lot of ideas, and we started playing around with the data, doing some preliminary visualizations to see some simple trends that could help nudge us towards a final research question.
We pivoted many times throughout the 7 hour datathon because our data kept telling us different stories. We started out with an idea I came up with: how much of a substitute product is bike sharing in reference to ride hailing or walking as a solution to the “last mile problem”, and how does weather affect it? We could measure things like cross-correlation, cross-elasticity, and how those numbers change as weather patterns change. However, this was too difficult; we needed to do extremely granular analysis and besides, it’s impossible to tell what a “last mile” in Manhattan is; practically everywhere you go is less than a couple blocks away from a subway station.
So we continued wrangling the data, plotting it geographically, looking at interday and intraday CitiBike demand trends, examining how different atmospheric attributes affected demand, and how demand was correlated with almost any factor we could think of.
We started compiling the document at around noon – Yisu started with his data and observations, and then put Henry’s discoveries in as well. I was busy trying to get interesting facts from the San Francisco bike sharing data, but it ended up not making it in our report :(. At 1:30, I started looking at the document. Before the competition had even started, I had already designated myself as the teammate who would write and turn in the paper, because English was their second language and I wasn’t about to let my IB education go to waste. I wrote the executive summary, introduction, conclusion, next steps; I drew conclusions in a concise and digestible manner; I edited and sanity checked the figures they had thrown in the document; and I formatted the document to make it look professional.
We turned in the paper (which you can read here) at 3:25, and we took a walk around the building to recuperate a little. A couple minutes later, everyone was finishing up, and many people had confident looks on their faces. I really wanted to climb, so I just up and left at 3:35 – I took the 6 train uptown to 96th and went to the Steep Rock Bouldering. A couple hours later, as I was walking home from the gym, my phone buzzed. It was WeChat – it said “We won” – no exclamations, no punctuation, no caps. I thought I was being trolled. So I continued on my way home and replied “LOL really”, to which they responded with a picture of them holding a huge check with $20,000 written on the amount line. A hurried taxi ride and 15 minutes later, I rushed back into the Convene, where my teammates were the only people left. The big check was gone, but they still had a little certificate of congratulations for me. We took a picture, and then we all parted ways again.
With this win, we earned an invitation to the Data Open Championship, where we will be competing for a grand prize of $100,000!
If you are participating in a Data Open, only read the following part:
- To win, you don’t need to do anything fancy. You just need a good question. Spend most of your time formulating an interesting question that can be feasibly answered in 4 hours. Start with this question in mind, and let it guide your data analysis.
- When your data analysis doesn’t go the way you intended, or yields results contrary to your hypothesis, don’t be afraid of pivoting. It’s worse to try to force data into the shape you desperately want it to be in.
- Story telling is extremely, extremely important. I’m not going to lie – I did probably 10% of the actual data analysis. My teammates were lightyears ahead of me technically. But what I provided to the team was the ability to put their results into a presentable form; a form that people could read and actually draw valuable conclusions from. I was able to connect what we did in 6 hours to the business value it provided for CitiBike. I was able to convey why our question mattered, and even more importantly, why our results mattered. You can have the most brilliant data analysis in the world, and if you don’t present it well, you will not win.
- I truly think having three technically strong people and one good writer who can also understand the technical details is the best team composition. You obviously need the technical details to back up your paper, but the technical details with no paper is worth nothing either.