Natural Language Processing: Keyword Strategy and Trend Analysis on the Cheap

In a previous post, I outlined a few different options that developers can use in order to extract meaningful content from documents and a few possible use cases. One of the tools that I touched upon was Topia Term Extract, a Python utility that can be used to extract information from web pages.

Today, I’ll demonstrate how Topia can be used in conjunction with a few other Python utilities, namely BeautifulSoup 4 and Mechanize, to analyze key terms in an RSS feed or web site. This tutorial assumes that you are reasonably familiar with the command line and general web development.

How to Use

The script takes a list of URLs from a text file and extracts keywords from the text content. Here is the overview of the steps needed to utilize the script.

  1. Download the code from the git repository.
  2. Install all dependencies listed in the README file contained in the download.
  3. Create a text file (.txt) the URLs you would like to process. This document should be formatted with one URL on each line.
  4. Run the Python script from the command line with the options you want.

Basic Usage

$ path-to-script/pykw.py -i “file-with-urls.txt” -o “output-file.csv” -c “content region”

If you are processing an RSS feed, the -c argument will probably be “description” since that is where the bulk of the information in an RSS feed is contained. If you are processing a web page, you can use a tool like

Suggested Use Cases

The script can be useful to analyze the content strategy of a web site based, trends in the news, or in-demand skills from a job board.

It is particularly advantageous to use URLs of RSS feeds since feeds follow a consistent format and this will allow multiple posts from more than one unique domain to be processed at a time. Every RSS feed has a description or title node which are rich with data ripe for harvesting information.

Another possible use case is to take the URLs from a sitemap and parse the page for content. The only limitation is that the pages need to have the same structure. This isn’t a problem if you’re trying to get all link text, but if you’re trying to get just the main article text, it would be best to process one domain at a time. This is because each domain likely uses different structural markup and there is no consistent format like there is with RSS.

If you want process multiple pages on the same domain, it is probably the most helpful to take URLs from the site map if available. If not, you can always run a report using a software tool like Xenu’s Link Sleuth if you are on a PC and create a list that way.

Conclusion

In short, natural language processing is powerful and has endless possibilities. By using open source technology, it is also possible to use these powerful tools on the cheap.

  • Filed under Development
  • By Ethan Gardner
  • Posted on 7th Jan 2013
  • Comments (0)

Like what you're reading?

If you find the topics I write about interesting or helpful, please consider subscribing or follow me on twitter.

I can also provide services similar to the topics I write about if you'd like to get in touch.

Contact me today

Recent Articles