TLDR: Anomaly detection over text data is important. Geneea offers it as a service. In Keboola Connection, it is available as well; for now, in beta version upon request.

Understanding a lot of text

Our clients get a lot of written feedback from their customers – support tickets, emails, or Facebook posts. For some of them, the number of these texts is so large that there is little chance that all of them will be read and handled by a human. And if they will, it won’t be right away. Interpretor, our NLP engine, analyses these texts in an instance extracting all the important information.

The results of our NLP analysis can be used in many ways: routing support tickets to the proper person based on their language and nature of the problem, batching similar problems together for easier support, detecting angry posts, extracting the main themes to support product planning, etc. In this post, we want to focus on anomaly detection, i.e., detection of unusual documents (posts, emails, …) or more generally documents that should not be missed.

Southwest computer problem

We cannot share the data of our customers, but we can look at some publicly available data. For example, public Facebook posts. For this blog post we have analysed Facebook posts mentioning Southwest Airlines from 2016, about 35,000 posts in total. Looking at the July data, we can immediately see that something happened between July 20 and July 23.

Looking at the topic tags derived from our text analysis, we can guess what happened.

Inspecting few posts

It’s okay SWA, computer glitches happen, I’m just proud to be a SWA A-list Preferred client.

and news sites confirms that a large scale computer problem occurred at Southwest at that time.

So it seems that looking at a simple count of messages to spot anomalies, and then using text analytics to see what is going on is enough.

However, there are several problems with this:

First, events worth our attention are not always so massive as this case. Moreover, if you are Southwest, you probably know about a failed computer system without analysing Facebook posts. However, often anomalous texts will be buried below a heap of text describing usual business. For example, a rude restaurant employee kicking out a breastfeeding mother might cause an event worth our attention, but might be witnessed and reported to a support by only a handful of other guests.

Second, anomalies might have multiple facets.

And that’s why we need a more sophisticated approach than simply looking at counts of posts.

Anomaly detection, similarity and clustering

For numeric data (credit card transactions, temperature fluctuations, number of retweets, etc.), we can usually spot many problems and oddities by simply looking at their deviation from a long term average. Obviously, we can do better with time discounting (treating recent events as more important than a more distant path), considering multiple variables together instead of separately, taking into account seasonality, etc. But in general, the basic approach still works.

With textual data, it is more complicated. One cannot simply count the events, as most posts or emails occur just once. Instead, we are focusing on the meaning of the posts. The technical aspects of this are not important, but in short: we model meaning as a multidimensional space, each post being one point in this space. Points that are closer to each other correspond to posts that have a similar meaning and vice versa. This way we can see that computer glitch, computer problem and computer failure have all roughly the same meaning. Anomalies are then detected as unusual changes in this space.

Using this type of analysis, we see that not all of the July 21 posts are unusual.

But not just that, we can see that the anomalous posts talk about two or three topics. Most of them simply report on the glitch, or complain about the delays and cancellations. But then there is a smaller group of posts complaining about Southwest’s handling of the problem, and finally, an even smaller group complaining about online checking (this might or might not be related to the main problem depending on how much the two systems are integrated).

Anomaly #1 – “Cancelled flights” (629 posts):

Problems persist today for Southwest Airlines – some flights from Bradley and LaGuardia were canceled this morning. (+4 similar)

My flight was cancelled and I need to get to Milwaukee, asap

Multiple delays, canceled flight, a connection delay ahead… All in one day. What’s next southwest?

Flight cancelled. New flight. Problem is I am in St. Louis.

Is flight 924 to Atlanta cancelled?

Anomaly #2 – “Hotel/voucher issues” (80 posts):

Southwest is the “States Worst Airline“! 49 minutes on hold with no answer. 7:30 flight shows departed when it never left. Cancelled at 10:20 and no flights available for two to three days. Sheriff was on site because no hotel rooms offered and tired customers were angry because of horrendous service and even worse communication!

I know things happen – I waited for 7 1/2 hrs to find out my connecting flight had already left and since Southwest was unable to put me in a hotel for the night and my flight was rescheduled to FRIDAY I now get to miss 2 days of work which means no pay. Thanks

Extremely disappointed with the way Southwest is handling this issue. No monetary reimbursement for hotel or rental car costs. Southwest vouchers are by no means an acceptable means of reimbursement.

Anomaly #3 – “Website issues” (18 posts):

I still am having trouble booking a flight on the website. Anyone else? (+2 similar)

I have been on hold for almost 2 hours waiting to speak to someone to help book my ticket! Website is having issues when I try to apply travel funds. 🙁

Hoping you guys extend the sale. I was on the website earlier today but decided to wait until tonight to book. Now the website is down. Really disappointed.

Under the radar anomalies

In some cases, important posts might be even harder to detect, because no obvious numerical variables are out of their normal range. For example, July 14 looks like an ordinary day. There were 135 posts, which is a little bit more than the average, but nothing unusual. However, when we run our analysis, we can see that 12 posts were identified as anomalous. In all of them, passengers complain about problems with online check-in.

For instance:

I cannot get a boarding pass for today or check in to my flight for tomorrow. Any update? (+1 similar)

I can’t log in to the website or mobile app to check in for my flight tomorrow. What is happening? Can you provide an update?

It’s really difficult to check in, go through security and board the plane when the app is down with an error!!!!

Anomaly detection: one-off, API, Keboola Connection

As mentioned above, we have done anomaly detection as part of one-off analyses for quite some time. What’s new? We are now offering it as an automated service as well, using REST API, Amazon S3 or SFTP to pass data. Keboola Integration, which is currently in beta, is coming soon. There is one important difference from our standard NLP analysis: we need to store historical data – at least the selected derived characteristics for each post or ticket to be able to compare them with the current posts.