One benefit of changing jobs quite a lot, is that I have built up an increasingly wide network of people that I like, who I have worked with previously.
A really nice thing about staying in contact with these people is that we are able to help each other out, sharing skills, jobs, jokes etc.
Recently a designer I used to work with asked whether somebody would be able to help with writing a script to process the exported contents of his ‘question of the week’ slack channel, which by default gets spat out as a folder filled with JSON files, keyed by date:
https://slack.com/intl/en-gb/help/articles/201658943-Export-your-workspace-data
https://slack.com/intl/en-gb/help/articles/220556107-How-to-read-Slack-data-exports
My response was rapid and decisive:
‘Data munging and a chance to use my favourite Javascript runtime Node.js. Sign me up!!!‘
First, WTF is data munging
Data munging, or wrangling, is the process of taking raw data in one form, and mapping it to another, more useful form (for whatever analysis you’re doing).
https://en.wikipedia.org/wiki/Data_wrangling
Personally, I find data wrangling/munging to be pretty enjoyable.
So, as London is currently practicing social distancing because of covid-19, and I have nothing better going on, I decided to spend my Saturday applying my amateur data munging skills to Slack’s data export files.
Steps for data munging
1) Figure out the structure of the data you are investigating. If it is not structured, you are going to have trouble telling a computer how to read it. This is your chance to be a detective. What are the rules of your data? How can you exploit them to categorise your data differently?
2) Import the data into a program, using a language and runtime which allows you to manipulate it in ways which are useful.
3) Do some stuff to the data to transform it into a format that is useful to you. Use programming to do this, you programming whizz you.
4) Output the newly manipulated data into a place where it can be further processed, or analysed.
In my case, the input data was in a series of JSON files, keyed by date (see below), and the output I ideally wanted, was another JSON file with an array of questions, along with all of the responses to those questions.
Shiny tools!!!
Given that the data was in a JSON file, and I am primarily a JavaScript developer, I thought Node.js would be a good choice of tool. Why?
-
It has loads of methods for interacting with file systems in an OS agnostic way.
-
I already have some experience with it.
-
It’s lightweight and I can get a script up and hacked together and running quickly. I once had to use C# to do some heavy JSON parsing and mapping and it was a big clunky Object Oriented nightmare. Granted I’m sure I was doing lots of things wrong but it was a huge ball-ache.
-
From Wikipedia, I know that ‘Node.js is an open-source, cross-platform, JavaScript runtime environment that executes JavaScript code outside of a web browser. Node.js lets developers use JavaScript to write command line tools‘.
-
JavaScript all of the things.
So, Node.js is pretty much it for tools…
So, on to the data detective work. I knew I very likely needed to do a few things:
1) Tell the program where my import files are.
2) Gather all the data together, from all the different files, and organise it by date.
3) Identify all the questions.
4) Identify answers, and link them to the relevant question.
The first one was the easiest, so I started there:
Tell the program where my import files are
const filePath = `./${process.argv[2]}`;
if (!filePath) {
console.error(
"You must provide a path to the slack export folder! (unzipped)"
);
process.exit();
} else {
console.log(
`Let's have a look at \n${filePath}\nshall we.\nTry and find tasty some questions of the week...`
);
}
To run my program, I will have to tell it where the file I’m importing is. To do that I will type this into a terminal:
node questions-of-the-week.js Triangles\ Slack\ export\ Jan\ 11\ 2017\ -\ Apr\ 3\ 2020
In this tasty little snippet, questions-of-the-week.js
is the name of my script, and Triangles\ Slack\ export\ Jan\ 11\ 2017\ -\ Apr\ 3\ 2020
is the path to the file I’m importing from.
Those weird looking back slashes are ‘escape characters’, which are needed to type spaces into file names etc. when inputting them on the command line on Unix systems. My terminal emulator that I use autocompletes this stuff. I think most do now… So hopefully you won’t have to worry too much about it.
This is also the reason that many programmers habitually name files with-hyphens-or_underscores_in_them.
But basically this command is saying:
‘Use node to run the program “questions-of-the-week.js”, and pass it this filename as an argument’
What are we to do with that file name though?
Node comes with a global object called process
which has a bunch of useful data and methods on it.
This means that in any Node program you can always do certain things, such as investigating arguments passed into the program, and terminating the program.
In the code sample above, we do both of those things.
For clarity, process.argv
, is an array of command line arguments passed to the program. In the case of the command we put into our terminal, it looks like this:
[
'/Users/roberttaylor/.nvm/versions/node/v12.16.1/bin/node',
'/Users/roberttaylor/slack-export-parser/questions-of-the-week.js',
'Triangles Slack export Jan 11 2017 - Apr 3 2020'
]
As you can see, the first two elements of the array are the location of the node binary, and the location of the file that contains our program. These will be present any time you run a node program in this way.
The third element of the array is the filename that we passed in, and in our program we stick it in a variable called filePath
.
WE HAVE SUCCEEDED IN OUR FIRST TASK. CELEBRATE THIS MINOR VICTORY
Now…
Gather all the data together, from all the different files, and organise it by date
const fs = require("fs");
const slackExportFolders = fs.readdirSync(filePath);
const questionOfTheWeek = slackExportFolders.find(
(f) => f === "question-of-the-week"
);
if (!questionOfTheWeek) {
console.error("could not find a question-of-the-week folder");
}
const jsons = fs.readdirSync(path.join(filePath, questionOfTheWeek));
let entries = [];
jsons.forEach((file) => {
const jsonLocation = path.join(__dirname, filePath, questionOfTheWeek, file);
entries = [
...entries,
...require(jsonLocation).map((i) => ({ ...i, date: file.slice(0, -5) })),
];
});
The Slack channel I am looking at munging is the ‘question of the week’ channel.
When this is exported, it gets exported to a ‘question-of-the-week’ folder.
So first of all I check that there is a question-of-the-week
folder. If there is not, I exit the program, and log an error to the console.
If the program can find it, then it gets to work gathering all of the data together.
Here we start to see the benefit of using Node.js with JSON. We are writing JavaScript, to parse a file which uses a file format which originally came from JavaScript!
This means that pulling all of this data together is as simple as getting a list of file names with fs.readdirSync
.
This gets all of the names of the files under the question-of-the-week folder in an array, which is, you know, pretty useful.
Once we have those file names, we iterate through them using forEach
, and pull all of the data from each file into a big array called entries
. We can use require
to do this, which is very cool. Again, this is because Node and JavaScript like JSON, they like it very much.
We know we are likely to need the date that the slack data is associated with, but it is in the file name, not in the data itself.
To solve this, we take the file name and put it into a ‘date’ field, which we insert into each data item, using map
the file.slice
stuff is just taking a file name like this 2018-06-29.json
, and chopping the end off it, so it is 2018-06-29
, without the .json bit.
Coooool we done got some slack data by date. Munging step 2 complete.
Identify all the questions
This is trickier. We need our detective hats for this bit.
I won’t lie, I fucked around with this a lot, and I re-learnt something that I have learned previously, which is that it is really hard to take data that has been created by messy, illogical humans, and devise rules to figure out what is what.
What I ended up with is this. The process of figuring it out involved lots of trial and error, and I know for a fact that it misses a bunch of questions, and answers. However, it probably finds 80% to 90% of the data that is needed. This would take a human a long time to do, so is better than nothing. The remaining 10% to 20% would need to be mapped manually somehow.
const questions = entries.filter(
(e) => e.topic && e.topic.toLowerCase().includes("qotw")
).map((q) => ({
date: q.date,
question: q.text,
reactions: q.reactions ? q.reactions.map((r) => r.name) : [],
}));
‘qotw’ is ‘question of the week’ by the way, in case you missed it.
I find them by looking for slack data entries that have a topic including ‘qotw’, I then map these entries so they just include the text, date, and I also pull in the reactions (thumbs up, emojis etc.) for the lols.
Now we have an array of questions with information about when they were asked. We’re getting somewhere.
Identify answers, and link them to the relevant question
const questionsWithAnswers = questions.map((question, key) => {
// Find the date of this question and the next one.
// We use these to figure out which messages were sent after
// a question was asked, and before the next one
const questionDate = new Date(question.date);
const nextQuestionDate = questionsWithReactions[key + 1]
? new Date(questionsWithReactions[key + 1].date)
: new Date();
return {
...question,
responses: entries
.filter(
(e) =>
new Date(e.date) > questionDate &&
new Date(e.date) < nextQuestionDate &&
e.type === "message" &&
!e.subtype
)
.map((r) => ({
answer: r.text,
user: r.user_profile ? r.user_profile.name : undefined,
})),
};
});
// put them in a file. the null, 4 bit basically pretty prints the whole thing.
fs.writeFileSync(
"questions-with-answers.json",
JSON.stringify(questionsWithAnswers, null, 4)
);
console.log('questions with answers (hopefully...) saved to "questions-with-answers.json"');
This bit is a bit more complex… but it’s not doing anything non-standard from a JavaScript point of view.
Basically just search all the entries for messages which fall after a question being asked, and before the next one, and put them in an array of answers, with the user profile and the message text. Then save to a new JSON file and pretty print it.
We are done! We now have a new JSON file, with an array of questions, and all the answers to each question.
It is worth noting that this approach is far from optimal from an ‘algorithmic’ point of view, as I am repeatedly checking the entire data set.
Thing is, I don’t give a shit, because my dataset is small, and the program runs instantly as it is.
If it started to choke and that became a problem I would obviously improve this, but until that point, this code is simpler to understand and maintain.
More efficient algorithms normally mean nastier code for humans, and until it’s needed, as a nice developer you should prioritise humans over computers.
(sorry, computers)
What did we learn?
Slack’s data is quite nicely structured, and is very parseable.
JavaScript is great for manipulating JSON data thanks to its plethora of array manipulation utilities.
You can write a script to automatically categorise Slack export data and put it into a semi-useful state with less than 80 lines of code, including witty console output and formatting to a nice narrow width.
This confirms my suspicion that for quick and dirty data munging, Node.js is a serious contender.
If paired with TypeScript and some nice types for your data models, it could be even nicer.
Here is the result of my labours https://github.com/robt1019/slack-export-parser/blob/master/questions-of-the-week.js