Natural Language Processing: Keyword Strategy and Trend Analysis on the Cheap
In a previous post, I outlined a few different options that developers can use in order to extract meaningful content from documents and a few possible use cases. One of the tools that I touched upon was Topia Term Extract, a Python utility that can be used to extract information from web pages.
Today, I’ll demonstrate how Topia can be used in conjunction with a few other Python utilities, namely BeautifulSoup 4 and Mechanize, to analyze key terms in an RSS feed or web site. This tutorial assumes that you are reasonably familiar with the command line and general web development.
How to Use
The script takes a list of URLs from a text file and extracts keywords from the text content. Here is the overview of the steps needed to utilize the script.
- Download the code from the git repository.
- Install all dependencies listed in the README file contained in the download.
- Create a text file (.txt) the URLs you would like to process. This document should be formatted with one URL on each line.
- Run the Python script from the command line with the options you want.
$ path-to-script/pykw.py -i “file-with-urls.txt” -o “output-file.csv” -c “content region”
If you are processing an RSS feed, the
-c argument will probably be “description” since that is where the bulk of the information in an RSS feed is contained. If you are processing a web page, you can use a tool like
Suggested Use Cases
The script can be useful to analyze the content strategy of a web site based, trends in the news, or in-demand skills from a job board.
It is particularly advantageous to use URLs of RSS feeds since feeds follow a consistent format and this will allow multiple posts from more than one unique domain to be processed at a time. Every RSS feed has a
title node which are rich with data ripe for harvesting information.
Another possible use case is to take the URLs from a sitemap and parse the page for content. The only limitation is that the pages need to have the same structure. This isn’t a problem if you’re trying to get all link text, but if you’re trying to get just the main article text, it would be best to process one domain at a time. This is because each domain likely uses different structural markup and there is no consistent format like there is with RSS.
If you want process multiple pages on the same domain, it is probably the most helpful to take URLs from the site map if available. If not, you can always run a report using a software tool like Xenu’s Link Sleuth if you are on a PC and create a list that way.
In short, natural language processing is powerful and has endless possibilities. By using open source technology, it is also possible to use these powerful tools on the cheap.