Multi-level and Relevance-Based Parallel Clustering of Massive Data Streams in Smart Manufacturing.

Information Sciences（2021）

引用 10|浏览37

摘要

Parallel implementations of incremental clustering have been provided to increase per-formances of data stream processing in smart factories, to enable real-time anomaly detection, remote diagnosis, condition-based monitoring of Cyber-Physical Systems. Incremental clustering algorithms iteratively extract and update over time clusters of data points (often denoted as micro-clusters) whose maximum number is bounded. However, the capability of controlling costs derived from the exploitation of computa-tional resources on the distributed architecture is challenging to enable a sustainable processing of massive data streams. In this paper, we present a multi-level parallelization approach for clustering massive data streams based on an horizontal scaling platform for Big Data processing. In particular, the following levels are considered: (i) a first paral-lelization level is based on a multi-dimensional model with exploration facets used to perform a first, coarse-grained partition of data streams, according to a divide-and-conquer strategy; (ii) a second parallelization level is based on a buffering mechanism, that splits the data stream into portions of data points on which processing is performed in parallel; (iii) the third level of parallelization is defined over the set of micro-clusters that are generated and change over time. The approach is conceived for anomaly detec-tion in smart manufacturing, where the concept of data relevance, defined in terms of distance from critical conditions of monitored systems, is used in order to force a stron-ger parallelization (and therefore higher resource usage) only when necessary, that is, when approaching to critical conditions. The scalability and efficiency of the approach are evaluated using a real dataset in a smart factory scenario. In particular, experiments demonstrated that when the maximum number of allowed micro-clusters decreases and the buffer size increases, parallelization based on buffering does not ensure good scala-bility. Additionally, as the number of features (that is, the complexity of data stream) increases, the parallelization based on buffering may present scalability issues. This paves the way to the advantages of tuning different parallelization levels according to the approach proposed in this paper. (c) 2021 Elsevier Inc. All rights reserved.

查看译文

关键词

Data stream,Parallel clustering,Big data,Apache Spark,Anomaly detection

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn