WhatsApp chat analytics: PowerShell Script

This is a post for developers or those who know PowerShell Scripting. To understand the context, please read these two posts… This article and the code is contributed by Raj Chaudhuri – my friend and Microsoft platform expert.

WhatsApp Chat Analysis using Microsoft Word
WhatsApp Chat Analysis using Microsoft Word – Part 2

image

In this article, I will explain a scripting based approach to parsing WhatsApp transcript.

Typical format is like this…

9:22pm, 23 Mar – Person1: How are you

9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?

9:23pm, 23 Mar – Person1: No. I don’t. You teach me.

I am willing to learn.

I will spend time and effort on it… if you teach me…

9:24pm, 23 Mar – Person1: Sure… I will teach you Smile

The problem

As described earlier, if you parse this using Excel or Word, the data would be like this.

image

We will need to delete rows 6 and 7. But that way, the chat content is lost.

Otherwise, we will have to merge the contents of A6 and A7 with the cell D5… which is also cumbersome – considering that there will be hundreds of such records in a large data set.

In short, it is a compromise.

If Power Query provides a robust regular expressions implementation, this problem can be solved. However, at the time of writing this article, Power Query (the M language) does not support regex.

The PowerShell Script

PowerShell is the simplest way available of writing powerful scripts, with full regular expression capability.

$c = Get-Content .\raj1.txt -Raw
$rx=”(\d{1,2}:\d{2}\w{2}),\s(\d{1,2}\s\w{1,3})\s-\s(\w*):\s”
$cstr =  $c -split $rx
$resarr = @()
for($i=1; $i -lt $res.Length; $i+=4) {
$n=new-object psobject -property @{Time=$res[$i]; Date=$res[$i+1]; Sender=$res[$i+2]; Message=$res[$i+3].Trim() }
$resarr += $n
}
$resarr | export-csv

I am not going to explain the code here. Those who understand PowerShell will know what his happening here easily.

Instructions

Copy the file and save it as a PS1 file. Change the path of the content file. Run the script.

You will see that the multi-row chat entry is properly parsed into a single record.

If you have any queries, do post them as comments.

Raj Chaudhuri

image

Raj is a close friend of mine. He works as an independent consultant, based in Bombay. His areas of expertise include BI, .NET programming and SharePoint platform design and development.

StarStarStar

3 thoughts on “WhatsApp chat analytics: PowerShell Script

  1. Okay, I’ll explain it then.

    The file raj1.txt contains an exported WhatsApp chat, in the format shown in the post. The first line of the script:
    $c = Get-Content .\raj1.txt -Raw
    reads the entire file, including line breaks, into a single string variable called $c. The real magic happens two lines down:
    $cstr = $c -split $rx
    The PowerShell -split operator splits a single string into an array, based on a delimiter. The magic comes from the fact that the delimiter can be a pattern, or a regular expression. In our case, the delimiter is contained in the variable $rx, defined one line above:
    $rx=”(\d{1,2}:\d{2}\w{2}),\s(\d{1,2}\s\w{1,3})\s-\s(\w*):\s”
    This is what it means: in the string being split, look for a pattern of 1 to 2 digits (\d{1,2}) followed by a colon (:) followed by two digits (\d{2}) followed by two letters (\w{2}) followed by a comma and a single space (,\s) followed by a pattern of 1 to 2 digits, a single space and 1 to 3 letters (\d{1,2}\s\w{1,3}) followed by a space, a dash and a space (\s-\s), followed by a pattern of any number of letters (\w*) followed by a colon and a space (:\s).

    So let’s look at the following lines, and split them using the regular expression above:
    9:22pm, 23 Mar – Person1: How are you

    9:23pm, 23 Mar – Person2: I am learning Word search. Do you know it?

    9:23pm, 23 Mar – Person1: No. I don’t. You teach me

    These will split as follows:
    9:22pm – because it matches (\d{1,2}:\d{2}\w{2})
    23 Mar – because it matches (\d{1,2}\s\w{1,3}) and follows the previous pattern and ,\s
    Person1 – because it matches (\w*) and follows the previous pattern and is followed by :
    How are you – because it is what is left over until the next (\d{1,2}:\d{2}\w{2})
    9:23pm – because it matches (\d{1,2}:\d{2}\w{2})
    and so on.

    The rest of the script reads the array four at a time, and produces a CSV file that looks this this:
    “9:22pm”, “23 Mar”, “Person1”, “How are you”
    “9:23pm”, “23 Mar”, “Person2”, “I am learning Word Search”
    etc. This CSV file can then be imported into Power Query to do the analysis.

    Actually, now that I think about this, a CSV file may not be a good idea, because the last field may have embedded line breaks. Doctor, can your Power Query import XML files?

  2. Pingback: Knowledge Pack: Data Analytics | Efficiency 365

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s