Web scraping: what it is and how to do it
From mix tapes to playlists: uncovering audio trends with web scraping
Matt Marzillo | June 17, 2015
More and more organizations are using web-scraping technology to capture external data and use it to identify strategic partners and stay ahead of their competitors. Google, Pinterest, and other behemoths of the Internet build bots to gain information on competitors and emerging market technologies.
The use case: uncovering audio trends
We’ve seen incredible advances in the way we consume music over the years—from records to cassettes to CDs to MP3s. In part because I’m an audiophile, and in part because I’m of an age where I’ve used three very different technologies to store and listen to music, I decided to test out web scraping to see what I could learn about the history of audio innovations. The US Patent Office felt like a great place to start.
With the advent of every new technology comes a wave of new inventions, and the US Patent Office has a treasure trove of publicly available data on its website.
The catch? The data isn’t clean. It’s semi-structured text data embedded in thousands of different pages. I built a web-scraping application to parse through the data and discover which type of audio file inventions historically presented the greatest opportunity for monetization. In the process, I also hoped to identify unknown market trends—and maybe even debunk some assumed ones.
I developed a web scraper in Python to scrape all text from patent abstracts related to the key words cassette, compact disc, and mp3. Once the data was extracted, the tm package in R was used to clean the text (including stemming, removing stop words, and standardizing different tenses). Once the data was cleaned, the same R package was used to convert the data into term-document matrices: a tabular form that can be analyzed using many approaches from any software. I stuck with R because of my familiarity with the software, but one could easily use Python or another advanced analytics program.
Cassettes had the most patents in the early 80s; CDs exploded in the mid-80s and into the 90s; and MP3 patents picked up steam in the late 2000s. It’s always nice when your first analysis passes the common-sense test.
Word frequency plots are a nice place to start the text analysis. Word frequency plots graph the frequency that each word appears in each category. When looking at the number of patents, I found several notable trends since 1976 regarding the volume of patents for each technology. It’s always a good idea to sort these plots high to low to highlight the influential data points.
I noticed that there was an interesting mix of healthcare terms in the cassette plot, with words such as patient and intravenous appearing unexpectedly. This is a good example of why it’s important to conduct a proper text analysis on the scraped data to remove any information that might not be relevant.
Additionally, many top words like records, open, and house seem to indicate that actual audio players were among many of the patents. Furthermore, I noticed that portability terms appeared when I looked closer at the MP3 information. Walkmans and discmans appear to have been the rage in the 80s and 90s, but the coolness of portable data didn’t take off until MP3s came into the fray.
Even deeper analysis
Clustering words will show us which terms appear commonly together and shed more light on the types of products developed for each technology.
A clustering algorithm was applied to each of the term-document matrices. The results show us that there is some natural clustering among the patents. The dendrograms in the figures above show a visual representation of the clusters. Words that are closer to each other are more likely to be in the same cluster.
A greater trip up the Y axis reveals that some words may not be occurring together frequently. With cassettes, I noticed a cluster to the right of the plot that contains words like intravenous, patient, and fluid (those previously seen healthcare terms). Those with clinical experience (or some Googling skills) would know that these patents are connected to cassette pumps, which is a technology involved with delivering IV fluids. Since clustering algorithms help find organic associations in the data, seeing the medical technology grouped together is nice proof that the algorithm is doing its job.
Looking at the CD clusters, we see two clusters near the right of the plot that involve words like cover and surface with case nearby. A review of the patents and the frequency plot confirms that many CD patents are connected to storing and enclosing CDs, as opposed to playing discs.
The most interesting point of the MP3 clusters is the association with portability and integration with cell phones. The cluster near the middle deals with sourcing and connecting the devices to power, which appears to be an emerging source of new patents.
There’s an abundance of information on the web. The trick is building a process that retrieves this information and organizes it in such a way that we can begin to quantify the known and uncover the unknown. Establishing a web scraping and text analytics process requires a little upfront effort, but once in place, these tools can be used to glean information from competitors, the general market, or the progression of emerging technologies. Once implemented these new insights can inform strategic decisions and allow organizations to continuously monitor market trends and stay ahead of competitors.