The Analytical Dream: Combining Dashboard and Data Stories
Experimenting with LLMs : Making Dashboards Tell a Story
Largely, there are two ways through which analytics professionals can share their work: dashboards and data stories.
Dashboards require end users to be somewhat domain and data-savvy to interpret the displayed plots. On the other hand, data stories are created directly by the analytics professional and shared through a PPT or a document.
I have personally found data stories to be more engaging and effective than dashboards, but they require more manual work. Every time you want to create a data story, you need to build a hypothesis, conduct analysis, and then share the data signals. Often, dashboards are created from good data stories.
One of my first projects at my last company was to build a CXO dashboard. After almost five months, I was asked a question: “What does the change in this KPI mean for the business? Are we doing well or poorly?” This made me introspect: what if the management didn’t have to ask me (or anyone) this question? Could we build intelligence into the dashboard to highlight whether we are performing well or poorly? Essentially, can it tell a story?
This would require combining the best of both worlds. It should be able to tell a story (highlighting patterns/signals) and function as a dashboard (without needing manual intervention/overhaul every time).
So, I started experimenting with LLMs. Here are some of my notes/learning:
The Data Signal is a free newsletter about Data Science & Analytics. You can subscribe by clicking on this nifty button below:
In-context learning
Prompting becomes crucial for LLMs to perform well. There are two pillars to providing good prompts:
Clear context setting: It is important to have no ambiguity in the prompt. It should clearly state what is good and what is bad for the business if we are tracking an important business metric. Only then will the model be able to capture the essence.
High information/word ratio: The count of words used to convey information is crucial. The goal should be to have high information per word. In most cases, the information fed will be constant, so reducing the word count is important.
Let’s consider this example:
#A Bad prompt: Mean block hour for cabin crew in Jan 2020 was 98.35 Mean block hour for cabin crew in Feb 2020 was 78.67 Mean block hour for cabin crew in Mar 2020 was 98.89 Mean block hour for cabin crew in Apr 2020 was 77.65 Mean block hour for cabin crew in May 2020 was 84.43
This can be made better like this:
#A Better prompt: Mean bh for CC in Jan 2020 was 98 Mean bh for CC in Feb 2020 was 79 Mean bh for CC in Mar 2020 was 99 Mean bh for CC in Apr 2020 was 78 Mean bh for CC in May 2020 was 84 bh is block hour, cc is cabin crew.
In both prompts, the information conveyed is identical, but we’ve used fewer words in the second instance, thereby reducing the token count.
No prompts
The idea here is to combine the essence of dashboards and data stories. We want to get the insights, but at the same time, users shouldn’t have to ask questions. Instead, they should receive the answers directly. The approach could be to pre-train the LLM models, generate the insights, and store them in the backend. Then, once the user selects a filter, the plot and accompanying texts can be shown to them, and the business user will know what’s happening.
For example, I generated a dummy prompt to see how GPT-4 performs. I purposefully introduced seasonal peaks in the prompt text to see how the model reacts. The answer it gave was quite good!
#Prompt Sample:
"Mean bh for CC in Jan 2020 was 98
Mean bh for CC in Feb 2020 was 79
Mean bh for CC in Mar 2020 was 99
Mean bh for CC in Apr 2020 was 78
Mean bh for CC in May 2020 was 84
...
....
...
bh means block hour, CC means Cabin Crew. Can you highlight any trend?
"
The answer:
High Variability: The block hours (bh) for Cabin Crew (CC) show significant variability month-to-month. For example, in 2020, the values range from a low of 55 in October to a high of 98 in January and March.
Seasonal Peaks: There are noticeable peaks at the beginning of each year. For instance, January 2020 and January 2021 both have high values (98 and 100, respectively). This could indicate a higher demand for cabin crew hours during these months, possibly due to increased travel during the holiday season and New Year.
Yearly Comparison: Comparing the same months across different years, there is a general increase in block hours from 2020 to 2021. For example, January 2020 had 98 hours, while January 2021 had 100 hours. Similarly, June 2020 had 58 hours, while June 2021 had 95 hours.
Wisdom of the crowd
Whatever is said and done, LLMs are arbitrary. They will often give random answers. So, it is important to verify the output. There are two ways that I can think of to make the outputs more reliable.
The first one is simple. Pass the output again with the input and tell the model to check it with a prompt, and then only proceed with it.
The second one is a bit more delicate. It uses the principle of Random Forest (or the wisdom of the crowd). The idea here is to send the same prompt multiple times and then calculate the similarities or dissimilarities of all the answers and assign a score based on that. This can be done with any simple NLP algorithm. From there, let’s say we derived a similarity matrix like this:
Here, i1, i2,…,i5 is the iteration counter. As we can observe, the average of the similarity score is the highest for i1. That tells us that, with the same prompt, i1 had more similarity with the other prompt responses than the rest. So, i1 is the safest bet, i.e., the “least random.” Whereas, i3 might be “the random” response which is totally different from other responses.
Even though, I have just mentioned it as “any simple NLP algorithm” here, it might need some fine tuning before giving the desired output like mentioned here.
Did you enjoy this post? Why not share it with someone?