How to Read the Internet (Effectively)
The broad scope of data science is a double-edged sword. Our professional lives span multiple fields, each containing decades of valuable discoveries with new insights added every day. How is a data scientist to stay afloat in this vast knowledge lake? Failure to utilize the right tool or approach can be costly for projects as well as careers. How do we avoid remaining in our comfortable tech bubbles, ultimately risking obsolescence?
Thankfully, we can solve this problem the same way we solve our programming questions: the internet. The advance of blogs, open-access publications and Massive Open Online Courses (MOOCs) have created a wealth of perpetually updated information, more than we could hope to process in a lifetime. Many of us are already familiar with this content and use it to some extent. Depending on my daily availability, I might flip through Hacker News, read a technical blog post about Principal Component Analysis, then surf my LinkedIn profile for inspirational quotes. But if my schedule is busy, I could go several days without reading anything. This scattershot approach has several issues:
Occasional skimming (or cramming, as the case may be) is a far inferior approach to learning a little bit each day.
Many important articles may fall through the cracks. If I only check my favorite machine learning blog on Fridays, I might have missed a super-relevant overview of a new implementation of Matrix Profile that was posted at the beginning of the week.
Without a process for determining what topics are important, my reading may be extremely inefficient. (Was it worth spending three hours yesterday on that blockchain primer?)
All of these issues can be solved by what I call the “Find-Collect-Process” method. With just 45 minutes per day, I can review hundreds of articles and only focus on the ones that matter, while making sure that few important posts slip through the cracks.
Step 1: Find
The first step is to identify sources of relevant content. I try to remain a “generalist” data scientist and hence follow many different publications, but if you’re an ML engineer, there may only be a handful of academic journals that you care about. Below are some good sources that I’ve found during my career:
1. Company blogs
Many tech organizations run blogs (Medium or otherwise) that highlight interesting work, new technologies in the field or just fun tutorials for beginning data scientists. Some of my favorites are Airbnb Engineering & Data Science, The Netflix Tech Blog, Kaggle’s No Free Hunch and Stitch Fix’s Multithreaded. I’ll also make a shameless plug for my alma mater, Insight Data Science, as well as my current employer, Tech@Target.
2. Data science/analytics platforms
There are several sources that serve as content platforms from individuals, academic organizations and companies that don’t have their own blog sites. Two great examples are KDnuggets and DataTau. I also love following Hacker News, but the quantity of stories can get overwhelming at times.
3. Individual blogs and everything else
I’ll admit that I don’t use these sources as much, but it can be helpful to get the latest thoughts from leaders in AI, data science and analytics. Brandon Rohrer, Monica Rogati, Hilary Mason and Ben Hamner have all provided valuable insight to me at various points in my career.
Step 2: Collect
Once we’ve identified our sources, the next step is to collect all new content for review. RSS (Really Simple Syndication) is a fantastic invention: By subscribing to a website’s RSS feed, I’m notified every time a new article is released. News aggregators like Feedly allow me to add and capture content in a single place. This is much easier than visiting every website on my list and scrolling through every headline; news aggregators also allow you to “save” articles across multiple days. Find what works for you, but I highly recommend Feedly (it even lets you follow Twitter accounts!).
Step 3: Process
Now that you’ve completed the first two steps, odds are that you’ll have anywhere from 20-200 articles to review per day. (Side note: Go easy on the number of sources when starting out and build over time.)
There are three levels of understanding when it comes to reading articles. They are:
For example, my colleague Sean Quinn recently posted an article on this site that we can break down using the above schema:
- “Flottbot-Creating Simple Chatbots in Seconds, Not Days”
- Flottbot is a chatbot framework abstracted away from a specific programming language. You only need to understand YAML.
- The technical inspiration for Flottbot, specific capabilities of the framework, basic “Hello World!” code example, etc.
Why are these distinctions important? It’s because my ideal level of understanding depends on my role and active project(s). Currently, I focus on time-series anomaly detection, so spending an hour going through every little detail may not be the best use of my time. However, by simply reading the title, I now know that there’s an easy chatbot implementation available should I ever need it, and that I can access the developers directly as we work for the same company.
In contrast, when I worked with a team that wanted to automatically answer questions in their customer chat room, understanding that Flottbot is language-agnostic would have been a huge time saver (one team member knew Ruby, another knew Python, another knew C++.,…). In that case, understanding the article synopsis would have been more appropriate. If our team then wanted to evaluate Flottbot as a potential solution (perhaps via a “Discovery” Scrum story), I could then go back and dive into the details.
The key to this process is to filter judiciously. Each level of the “Find-Collect-Process” filter should remove roughly 90 percent of the content passed through. That is, only 10 percent of all article titles are interesting enough for me to ascertain a synopsis. Of those articles, only 10 percent are relevant enough for me to read the full details. Finally, of those articles, 10 percent will be so good that I’ll save or bookmark them for long-term reference. Thus, I’ll read 100 article titles, skim 10 articles for their synopsis, and only read one(ish) article per day. Much more manageable! Your percentages may vary, but the idea remains the same: filter, filter, filter!
Another benefit of reading articles at a less-intensive level of understanding is recall reinforcement. I don’t currently know the inner workings of TensorFlow, but by skimming articles I’ve learned that 1) it’s a powerful deep learning framework produced by Google, 2) there’s an easy-to-use framework called Keras for building models in TensorFlow, and 3) TensorFlow 2.0 will soon be released with many new features (at least as of this writing). This kind of “dictionary knowledge” serves as my own Wikipedia that I can use to form the basis of future project research.
Remember, this should be fun and you should be learning! If you find yourself getting swamped by too many articles, cut down on the number of publications you follow. Source XYZ hasn’t published anything useful in weeks? Get rid of it! Remember, the idea is to consistently put in a small amount of time throughout your career. Good luck!
Andrew Van Benschoten is a lead engineer focused on full-stack data science. He’s a co-developer of the Python package matrixprofile-ts as well as a part-time Kubernetes evangelist.