Use Clustering to Detect Time Series Anomaly

Bingblackbean
6 min readJan 21, 2022

Introduction about Time Series Anomaly Detection

Anomaly detection is a widely discussed topic in many fields. For times series, the basic anomaly detection means finding out the outlier from data alonperiodan.

Some anomaly detection scenarios in specific fields are listed here:

  • Financial field: the flash crash of the US stock market
  • Operation & Maintenance field: Computer operating system monitoring and diagnosis
  • Industrial: Industrial Control System and Equipment Diagnostics
  • Internet: Abnormal user operation behavior
  • Medical: EGG Diagnostics

Indeed, the definition of an outlier lacks an accurate and unified standard. It may be considered as a small number of samples (the proportion is less than 50%) differ from the “majority data”.

Or it is measured by a certain “distance”, which is far from a certain “standard baseline”. if the deviation from the “expected value” is large, then it is abnormal.

Well, how to precisely define “majority samples”, “minority samples”, “standard baseline”, “distance”, “expected value”, etc. brings up a variety of (unsupervised) anomaly detection algorithms.

  • For univariate time series, you can use the “mean of historical data” as the standard baseline. If the distance between two points is higher than a certain predefined threshold from the mean value, then the latest point is regarded as Anomaly.
  • For multivariate data (multivariate time series), the dimension can be reduced and the en standard baseline and threshold method still fits.
  • It is also possible to make a prediction of time series and compare the predicted value to true value. If the bias is greater than a certain threshold, we can say that anomaly is detected.

Today, I would like to share how to use the clustering method to detect anomalies.

K-Means and Anomaly Detection

Means is the best known and most commonly used clustering algorithm. It clusters data by trying to divide the samples into n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. It has many advantages:

  • Simple and intuitive, state of art
  • Explainable
  • Herein multiple ideas will be shared to explore anomaly detection by using K-Means

Dataset

Here the dataset of AI4I 2020 Predictive Maintenance Dataset will be used to explore.

df = pd.read_csv(file)
df.index = pd.date_range(start='2020-01-01',periods=df.shape[0],freq='1min')
dataset preview

The dataset consists of 10 000 data points stored as rows with 14 features in columns.

  • UID: unique identifier ranging from 1 to 10000
  • product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
  • air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
  • process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
  • rotational speed [rpm]: calculated from power of 2860 W, overlaid with a normally distributed noise
  • torque [Nm]: torque values are normally distributed around 40 Nm with a ƒ = 10 Nm and no negative values.
  • tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.
  • ‘machine failure’ label that indicates, whether the machine has failed in this particular data point for any of the following failure modes are true.

It should be highlighted here that in a real scenario we may not be able to obtain the actual value.

In addition, the original dataset has no timestamp, here I manually assign the timestamps.

Option 1: Detect Minority Clusters

Since outliers are minority, then minority cluster can be treated as outliers/anomaly.

We can cluster the samples and then define the minority class as the outlier class. Based on such idea, any clustering algorithm can be used for such purpose, such as K-Means, BayesianGaussianMixture, etc

In order to facilitate the result comparing, a plotting function to compare the predicted anomaly with the actual value is defined firstly.

def plot_anomaly(ts,anomaly_pred = None,anomaly_true=None,file_name = 'file'):
fig = go.Figure()
yhat = go.Scatter(
x = ts.index,
y = ts,
mode = 'markers', name = ts.name,marker= {'color':'black','size':2})
fig.add_trace(yhat)
if anomaly_pred is not None:
status = go.Scatter(
x = anomaly_pred.index,
y = ts.loc[anomaly_pred.index],
mode = 'markers', name = anomaly_pred.name,marker= {'color':'green','size':10,'symbol':'circle','line_width':2})
fig.add_trace(status)
if anomaly_true is not None:
status = go.Scatter(
x = anomaly_true.index,
y = ts.loc[anomaly_true.index],
mode = 'markers', name = anomaly_true.name,marker= {'color':'red','size':10,'symbol':'star-open','line_width':2})
fig.add_trace(status)
fig.write_html(f"{file_name}.html")

We need to create and train a K-Means model, here Sklearn is used.

kmeans_dist = KMeans(n_clusters = 5)
​df['cluster'] = kmeans_dist.fit_predict(df.iloc[:,3:8])

The second step is to find the class with the smallest number of samples. Of course, you can treat multiple classes with the smallest number of samples as the anomaly classes.

minor_label = df['cluster'].value_counts().idxmin()
df['cluster_flag'] = (df['cluster'] == minor_label)

So far the anomaly has been predicted. We can plot and validate the results by using confusion matrix as well.

option1: predict anomaly vs true in scatter plot
option1: predict anomaly vs true confusion matrix

As a result, although some anomalies were identified, there are too many anomalies that are not detected.

Option 2: Detect Anomalies from Each Cluster

The above ideas may not be reasonable in real case, because each cluster may represent different conditions, such as operating conditions.

So we have to think the way differently, which is, within each cluster if the distance between a sample and cluster centroid is greater than certain threshold, then it may be an anomaly.

To precise, we can cluster all the data into 1 class (yes, 1 class, similar to the effect of dimensionality reduction), after that, check the distance of the samples to the centroid.

Then we define a contamination index, e.g. if I have estimated that there will be around 400 anomalies in the dataset, then the top 400 points with the largest distance will be marked as anomalies.

contimination_fraction = 400/10000

So the second step of option 1 method has to be modified as:

df['dist'] = pairwise_distances(df.iloc[:,3:8],kmeans_dist.cluster_centers_).max(axis=1)
contimination_fraction = 400/10000
n_largest = df['dist'].nlargest(n=round(contimination_fraction*df.shape[0])).index
df['dist_flag'] = 0
df.loc[n_largest,'dist_flag'] = 1

It can be seen that the detection results are different from option 1, if we change the number of clusters, then the result can be improved as well. But we won’t cover the tuning part in this post.

Bonus & Furtherly Improve Hits

K-Means also has shortcomings which is not covered in the previous chapter.

  • For example, it can only fit numerical data, and needs to be encoded for discrete data.
  • The K value is an unstable hyperparameter even it can be selected by an elbow curve, such as increasing to 5 or more.
  • Since the initialization of each cluster in K-Means is random, the predicted results are also random. In good practice, the problem can be eliminated by running the code multiple times and taking the average value.
  • K-Means is isotropic, so standardization is still necessary.

As an example, we make improvements by adding standardized changes to eliminate the influence of different scalers, and we can see that the result is improved.

X=pd.DataFrame(StandardScaler().fit_transform(df.iloc[:,3:8]),index=df.index,columns = df.iloc[:,3:8].columns)
Improve K-Means result by using standardization

Reference:

  1. dataset: https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset#
  2. K-Means tips:

--

--