The K-means Grouping Method as a Mean to Control the Performance of the Production Process

The paper presents a concept of using clusters of objects using the k-means method to control the performance of the production process, which runs under variable conditions. The distribution of the production process performance in production cycles grouped according to similarity is the basis for controlling the performance of subsequent production cycles. The practical part of the paper contains an example of calculations carried out according to this concept using the VBA and R languages, and is relates to the bolting process in underground mines.


Introduction
Planning of mining production is an extremely complex process. The specific nature of mining production means that not only performance indicators related to the production process or economic factors must be taken into account, but also environmental factors [14]. On the other hand, it is the environment, among other things, that ensures that the conditions for production process are never fully repeatable [1]. Technical conditions related to the equipment, environmental conditions related to the strength of the rock mass and conditions related to the employees' competences may vary, and there may be many other differences, even those that may not have been identified yet. Therefore, in addition to standard analyses, attempts can be made to analyse the production process itself and, based on its course, to predict the effects achieved.
Quantitative methods play an increasingly important role in an effective management of a mining company. The analysis of large amounts of data [2], modeling and simulations of production processes in mining [9,10] become the basis for rational decision-making, e.g. in the works [4,5,6]. The universality and ease of performing such calculations has become possible as a result of the enormous development of computer methods of data processing and the associated development of calculation software.
The calculation part of the article presents the concept of using the clustering method for ongoing control of the performance of implemented processes based on the previously collected data.
The adopted concept assumes that an appropriate partition into groups of previously implemented production cycles makes it possible to show differences in the production results achieved in them. In addition, it assumes that an object assigned to a group of similar objects will achieve similar values of production performance indicators to other objects from the same group.
This requires prior clustering of objects into mutually unlike groups of objects similar to each other. The scope of use of taxonomic methods, the possibilities of their application and ordering, grouping procedures, selection of representatives, aggregate variable structures, etc., used in taxonomic calculations, are widely discussed in the literature, while this paper is limited to those that are necessary to present the proposed concept.

Methods and tools used 2.1. The k-means method
The adopted concept assumes the clustering of data stored in the computer system. In the numerical example presented later in the article, the k-means method based on the k-means clustering was used [7]. In this method, a priori, a predetermined number of clusters is given and subsequently the cluster centers (centroids) are determined. Centroids can be determined arbitrarily or randomly. In many cases, the algorithm is carried out repeatedly while the results are observed, and the best model is selected subsequently. In the next step the distance of subsequent objects to be clustered from centroids is to be calculated. Distances can be calculated based on various metrics (Euclidean, Manhattan, Chebyshev, etc.), which most often originate from Minkowski metric (formula No. 1). (1) where: x ij , x kj -implementations j-identifier for i-entity k-entity, n -number of entities m -number of identifiers, p -natural number, In practice it is used: Metric space (p = 1), Euclidean space (p = 2), Chebyshev metric (p→∞), in the latter case, the metric comes down to a formula: Submission date: 19-12-2019 | Review date: 28-01-2020 (2) The next step of the algorithm is to assign an entity to a cluster for which its distance from the centroid is the smallest. Assigning an entity to a cluster changes the position of its centroid, and calculating its new position is the next step of the algorithm. The above steps are then repeated for the subsequent entities to be clustered.

IT tools
The calculations were made using R and VBA languages. The R language is a programming language for the R environment, mainly dedicated to statistical calculations and data mining [3]. It is distributed under a GNU license, so it does not require any fees. Furthermore, as an "open software" it is constantly being expanded with new libraries of functions available on the Internet in the form of packages. Many users of the language claim that it can be used to write any program, but its biggest advantage for analysts is that it supports issues such as: modeling, testing, classification, clustering, time series analysis, etc. The possibilities of creating a wide range of high quality charts are also worth emphasizing [11].
R allows organizing data into various structures, and the simplest of them is a vector. This is a series of numbers in order. There are no simpler structures in R language; therefore, a single numeric is a vector with a length equal to 1 [13]. Vectors can be combined into lists or matrices, matrices can be transformed into data frames, and for each of these structures appropriate operators and functions are provided to enable them to perform calculations on them. It is also important that this language is constantly evolving. New function libraries collected in the open, free package repository (CRAN, Comprehensive R Archive Network) are constantly being published . This is a collection of function libraries with documentation created by system users from around the world.
The Visual Basic for Application (VBA) language is part of the Microsoft Visual Studio development environment and is also a Microsoft Office tool. The success of this language is due to its simplicity, transparency and flexibility with a wide range of possibilities. Therefore, it is used by people with high programming skills as well as by beginners. AutoDesk -a manufacturer of software supporting engineering design, has also noticed the advantages of this language. For its most important product, namely AutoCad, from the 2015 version it provides a module that allows using the Visual Basic language working with this program. [12].
Programing in Visual Basic is an event programing. This means that the program code is called as a result of an event, e.g. pressing a key, pressing a button, selecting from a list, etc. Such an event triggers the procedure that supports it, followed by a return to waiting for the next initiating event [15]. This language is a structural language with sequential processing, in which extracted structures include fixed variables, arrays, functions, subprograms, conditional operations and loops. Visual Basic, after attaching the relevant libraries, has the ability to connect to MSSQL databases and query them. The responses received from the database are then processed accordingly.

Example of calculation
As part of the research related to the implementation of the smartHUB project, as one of many tasks, data on the implementation of the bolting process was analyzed. The data was recorded automatically and contained the duration of individual activities related to the bolting process. In the analysed period of time, 71205 activities were registered within 854 cycles. A segment of the analysed data is shown in Fig. 1. As it can be observed, the data contain the duration of the activities that make up the bolting process and have been assigned change and production cycle identification number. Many cycles are carried out during one shift. The activities carried out within a cycle are "Hole Setup", "Transitional delay", "Drilling" and "Anchoring". Cycles are separated by activities that are not part of them (e.g. "Travelling"), which are carried out as part of a production shift.

Searching for similarities
In the first place, the repeatability of activities in the production cycle was analyzed. For this purpose, a script was created in VBA that assigned one-way codes to subsequent actions (Fig. 2.) Subsequently, these codes were combined in sequences of letters, according to the order of implementation of activities within the whole cycles. The code of the script implementing this combination is shown in Fig. 3.
The result of the script is shown in Figure 4. It clearly shows that the letter arrangements are of different lengths and regularities are difficult to find. This confirms that the cycles are very different.
In order to quantify the degree of repetition of activities within the cycles, activity systems were clustered. This was accomplished by importing codes into the R language and the number of repetitive systems was counted subsequently. Import of data from an Excel file was carried out using the read.xlsx function from the "openxlsx" library pat <-read.xlsx("patterns.xlsx",sheet=1) The object created in this way contains data in the same arrangement as in Fig. 4. The next step is to cluster the data with the same code system. In each group, the occurrence of entities was counted, resulting in a summary of number of repeating codes. The last activity of these calculations is sorting the groups in order of decreasing numbers.
pat %>% group_by(pattern) %>% summarize(n=n()) %>% select(pattern,n) %>% arrange(desc(n)) -> patterns; View(patterns) From the aggregation shown in Figure 5 it can be read that the most numerous groups contained 4 elements (this corresponds to approx. 0.5% cycles). This gives an idea of how different the systems of activities in cycles are from each other.
Another attempt to look for similarities consists in analysing the direct sequences of activities. As a result of appropriate data processing, it was transformed into a matrix in which the number of pairs of directly sequential activities was compiled. As before, a corresponding VBA script was written for this purpose. Such an aggregation ensures a more detailed picture of the implementation of the analyzed cycles by identifying unusual cases. Atypical cases are cases where an activity occurs after an activity that should not precede it. Table 1 shows the matrix, in which the sequences of actions in all analysed cycles are listed. The rows of the matrix represent the preceding actions, while the columns contain the subsequent actions. It can be read from Table  1 that immediately after the "DRILLING" activity in 4860 cases the "TRANSITIONAL_DELAY" activity followed, in 575 cases it was the "HOLE_SETUP" activity, in 1257 cases "ANCHORING", while in 13 cases this activity ended the cycle.
As already mentioned, Table 1 summarises data from all analyzed cycles. For comparison, Tables 2 and 3 show aggregation from single cycles. It can be noted that they differ in their numerical values, but equally importantly, they differ in the places where the numbers occur. Unlike the cycle A46121946V, the cycle A4614246V started with the activity "HOLE_SETUP" and after the activity "ANCHORING" the activity "HOLE_SETUP" was repeated twice.
The framework of this article does not allow for presenting more differences between cycles, however, it can also be seen in Figure 6, which presents a segment of an aggregation of the activities sequence in subsequent cycles.
In the table, the segment of which is presented in Figure  6, compared to tables 1, 2, 3 those columns were removed in which the number of data was residual (e.g. "ANCHOR-ING-DRILLING"), data on the number of the "ANCHOR-ING" activities was added instead, as well as the total duration of all activities in a cycle ([s]), the quotients of which were taken as a measure of the performance of individual cycles. For the sake of simplicity, the names of the columns with the following actions were changed (e.g. "DRILLING-ANCHOR-ING" for "DA").
In the data processed in this way, similarities can be observed due to the distribution of consecutive activities as well as their number. This data set was therefore adopted as material for clustering the cycles into different sets of similar cycles.

Data clustering
Data clustering was performed using the k-means method. For this purpose, the k-means function provided by the R language was used. However, the data had to be processed before in order to eliminate outliers that could significantly affect the quality of the clustering. An example of such data is the occurrence of a cycle lasting one second in the data that occured in the data set probably due to an error.
The following are the next steps in R: • loading the data from an Excel file into the entity bmat 221 (data shown in Figure 6), • creating an eff2 entity in which outliers are omitted, eff2 <-filter(eff2,eff>=ll,eff<=ul) • determination of the number of groups based on the "elbow" method.
In this method, the number of groups is depicted on one axis and the sum of squares of the distances of individ-ual observations from centroids on the other. We choose the number of groups after which adding another group does not bring such great benefits. creating an "elbow" chart, ggplot(data.frame(x=c(0:13),y = z),aes(y=y,x-=x))+geom_point() Analyzing chart 2, it was assumed that after the tenth point, the measure decrease is insignificant, so a partition into 10 groups was assumed. As a result of clustering, we get information about which of the ten clusters has been assigned to each cycle. At the same time, the k-means function returns the coordinates of the group's centres. These are shown in Table 4.  In the next stage of calculations, a column with the numbers of clusters to which the cycles were assigned was added to the data based on which the cycles were clustered. The eff3 object was created in this way.

Control of the performance achieved
The clusters designated above, together with the performance distributions observed in them, can be used for the ongoing process control. Due to the calculations carried out, it is possible to confront the performance in the current cycles with the performance observed before, but taking into account the type of production cycle. An example of this is given below.
The data concerning the activities system in the analyzed cycle are saved as a vector named newdata: newdata <-c(0, 10, 2, 0, 7, 0, 0, 17, 8,10,20) Then the predict.kmeans function is called, which returns the number of the cluster to which the new cycle will be classified.
By filtering the data in the way that it only contains cycles belonging to this cluster, the average value and median performance in this cluster can be determined and compared with the performance observed in the new cycle. As a result of this comparison, it is possible to assess how the observed performance ranks among the others in this group and whether it should be assumed that it has achieved the expected result or whether actions should be taken to learn the effects of the deviation.

Conclusions
The performance of the production process can depend on many factors. In cases where it is not difficult to determine the impact of external factors on the production process, determining its expected performance is a simple task. In the mining industry, constant repeatability of the production process practically does not occur. There are a number of areas of external factors (environment, equipment, people) that cause its variability. The analysis presented in the article clearly shows that the repeatability of the analyzed production cycles is basically incidental. The variability of production cycles results in the variability of their performance. The proposed concept of process performance control takes into account their variability by comparing them with the performance observed in separate groups of cycles.
At the same time, the k-means clustering method and the calculation process using VBA and R languages were presented in practice.