Starting text mining with R


Like a lot of Product Managers I use Excel with tools like Google Analytics a lot. Probably like many people I find Excel very frustrating. So having been technical in a previous life, I decided to give R a try. What is R?

 R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
After taking a simple intro course, I started to look around at examples of doing more interesting things. From a work point of view Topic modelling seemed really interesting. Unfortunately a lot of the example code missed a large step (or two) in being useful! 

So I forked the most complete example that I could find in GitHub called "Text-Mining". This did lead me down a bit of a rabbit hole but eventually I had an R script that collected data, cleaned it before doing some analysis. Success! I'm almost a data scientist! ;-) The link has all the details of what I did to get started and some pointers to how you can experiment.

Looking around RStudio the possibilities of what I could do with this were interesting. In particular the R markdown files. The programmatic aspect makes automated report generation a lot nicer than Excel. I think with the chart wizards that Excel is good for one off charting etc. For repeated processing of similar data being able to copy and paste, then tweak in a text editor is much quicker.

With that in mind, the next step was to look at more real world problems/analysis that I wanted to do. So I hunted for some Google Analytics packages. The first I tried was RGoogleAnalytics. This had issues in authenticating and the guide was completely different to the Google site. A little bit more searching and I found googleAnalyticsR, and after installing I managed to get a session. It was at this point that I realised what a faff the Google Analytics API was!


After some work I did manage to get some data out of Google Analytics. This featured a dynamic relative data lookup for the past 200 days. Then group the sessions by day. Turn it into a time series, where I could run the Holt Winters method to get a prediction for how many sessions we would get in the next week.

I also had some mind bending sessions debugging Spanish code and variable names to update an interesting blog on visualising Google Analytics data. I'm looking to see if the sample code is on GitHub anywhere so I can send a pull request with my updates. Otherwise I'll share my own adaptation with some useful R Markdown reports.

Getting stuck in with the "not quite working" Text Mining code was really useful in learning R. Even though it was a toy scale problem, it gave me something practical to develop skills. I'm almost at the same level as Excel now. 

Further reading

Comments

  1. Ink for all expedite my workflow for posting blog articles to our static site with its integrated Markdown export tool. You can export Hugo-compatible docs as well

    ReplyDelete
  2. Great Article. Text mining is really a big thing and I always use R for this.
    Btw, you can add this article: Projects of Data Science in the resources. Its a easy and fun list of mini-projects.

    ReplyDelete

Post a Comment

Popular posts from this blog

CONFERENCE: TTI Summer Forum 2017 – Getting to Grips with GDPR

On performance and environment

On HBX and online education