Stability, frame rate, and tracking points.
Open access peer-reviewed article
This Article is part of Robotics Section
Article metrics overview
261 Article Downloads
View Full Metrics
Article Type: Research Paper
Date of acceptance: August 2024
Date of publication: August 2024
DoI: 10.5772/acrt.20240002
copyright: ©2024 The Author(s), Licensee IntechOpen, License: CC BY 4.0
This research aims to enhance the collaborative work system between humans and robots by exploring “ensemble music.” In an ensemble, it is crucial to adhere to the score, synchronize it with the breathing of fellow musicians, and ensure harmonious performance. This represents one of the most intricate collaborative endeavors achievable by humans. In this study, by examining various image processing methods for detecting the movement of a performer, it was shown that skeleton detection using MediaPipe is appropriate in terms of a large amount of information and processing speed. Next, a deep neural network utilizing the history of MediaPipe’s 3D skeletal coordinates as the input was developed to detect the performance start and end points. A comprehensive examination of learning and estimation conditions via grid search revealed that the start and end points could be estimated with approximately 70% and 100% accuracy, respectively, when using a history of 10 points, the ReLU activation function, and the L-BFGS optimizer. Additionally, the estimation time was 10 ms or less when the hidden layer had 100 or fewer units. Future detection accuracy will be enhanced by incorporating additional learning data and assigning greater weights to skeleton points with significant changes.
ensemble
cooperative work
neural network
skeleton detection
pure chord
Author information
In recent years, there has been progress in the research and development of robots that collaborate with humans, against a background of labor shortages in various industries. Currently, cooperative robots are mainly utilized for machine tending purposes, which reduces the need for safety fences [1], based on various safety standards [2–4]. In the coming years, the need for more sophisticated collaborative work systems is predicted to rise. This is aimed at boosting productivity in different sectors, given the dwindling working-age population.
The study aims to explore the concept of “ensemble” to enhance collaborative work systems. In this context, “ensemble” refers to not only adhering to a predetermined score but also attuning to each other’s breath, which is a pinnacle of sophisticated collaboration among humans. Therefore, by developing a collaborative system that combines human and machine capabilities, we expect to see wider applications across multiple industries and the emergence of more sophisticated collaborative work systems.
Detecting the performer’s movements in this way is essential for improvisation support for music beginners [5, 6] and for performance movement learning support [7]. Alternatively, in a device that controls volume and other functions with hand gestures while playing [8], it is possible that control can be performed only with the movements necessary for playing.
Although there have been examples of research on systems in which robots play together based on synchronous signals, these have not been designed for performance with people [9, 10]. There is research on real-time musical synchronization between a human musician [11] and an accompaniment system that detects cues from breath [12], and also research on a visionary collaboration that transcends time by reproducing the performances of past masters on an automatic piano [13, 14]. However, there have been no efforts made to achieve dual coordination, incorporating not only the temporal aspect but also the frequency dimension. This involves synchronizing the moments when a human and a performance device synchronize, seamlessly playing a harmonious chord without any fluctuations [15]. So far, we have envisioned a “just intonation concert system” that plays in concert with a human player, and prototyped and established the “player motion detection system” and “real-time volume pitch control system,” which are the two main subsystems. Several movement experiments were conducted [16].
This paper describes the comparative results of the performance of various tracking methods used in the player motion detection system, and experiments on the detection of the start and end timing of a performance using MediaPipe [17] and a neural network. The present paper includes material previously reported in oral presentations [18, 19], but new experimental results also have been added.
OpenCV’s cascade classifier automatically and consistently identifies particular color patterns in video frames to recognize human frontal faces and eyes. Obtaining the positional coordinates of the identified frontal faces and eyes on the pixels opens up the possibility of using the change history for motion discrimination.
The OpenCV tracker is a software tool that enables users to define a region of interest (ROI) in a video frame and subsequently track areas with similar characteristics in subsequent videos. Its positional coordinates can be used for motion discrimination, similar to the cascade classifier. The study employed KCF and MIL as tracking algorithms.
MediaPipe is a video analysis library for the detection of various objects. MediaPipe Pose, a human skeleton detection model, can predict the coordinates of a total of 33 points on a human body in a video frame, including the nose, both eyes, both pupils, both eye corners, both ears, both shoulders, both elbows, both wrists, both little fingers, both index fingers, both thumbs, both hips, both knees, both ankles, both heels, and both feet.
The processing speed and tracking stability were compared for frontal face detection using a cascade classifier, tracking of arbitrary objects (frontal faces of persons) using a tracker (MIL, KCF), and skeletal detection using MediaPipe. A Razer Blade Stealth 13 (CPU: Intel Core i7-1165G7, GPU: GeForce GTX 1650 Ti) was used as the analysis computer.
The camera images were acquired in real time, with a resolution of HD (1280 × 720 px) and a frame rate of up to 30 fps. The programming language used was Python (version 3.8.12, development environment Anaconda3 for Windows 64-bit). The processing speed was evaluated as the frame rate for a series of processes from reading image frames from the camera to the tracking process. The evaluation of tracking stability was based on the presence or absence of tracking failures under the following six conditions:
Face forward at a distance of approximately 0.5 m from the camera
Face rotated at an angle of approximately 45° at a distance of approximately 0.5 m from the camera
Face on at a distance of approximately 0.3 m from the camera
Face on at a distance of approximately 0.7 m from the camera
Half cover the facing face with one hand
Remove one hand from the face.
Figures 1 and 2 show the status of tracking by each method and a comparison of the frame rate (processing speed) time transition of each method, respectively.
Tracking by the cascade classifier accurately detected the frontal face and both eyes (three points in total) at a processing speed of around 10 fps, regardless of the perspective of the tracking target, but could not detect oblique faces or faces covered by one hand.
Tracking by MIL and KCF was possible at processing speeds of around 15 fps or 30 fps, respectively, but the ROI range did not change depending on the perspective of the tracking target, which raises questions about the reliability of the coordinates. In addition, it was found that there was a problem with the stability of the tracking as the ROI deviated from the face from (e) to (f).
Software | Frame rate [fps] | Tracking stability | Tracking points |
---|---|---|---|
Cascade | 10 | Low | 3 |
MIL | 15 | Middle | 1 |
KCF | 30 | Middle | 1 |
MediaPipe | 20 | High | 33 |
The processing speed of KCF is sometimes as fast as 60 fps but sometimes drops to about 10 fps, so stable processing cannot be expected.
The tracking by MediaPipe was found to be stable at around 20 fps with no tracking failure in any of the cases (a)–(f).
Table 1 shows a comparative evaluation of the above results in terms of processing speed, tracking stability, and number of tracking points (the amount of information that can be acquired). The best balance was achieved by the tracking by MediaPipe. MediaPipe was subsequently adopted as the method for detecting the player’s movements from the camera images.
Figure 3 shows the block diagram of the system for detecting player motion. The audio is captured by the microphone attached to the recorder and then processed through a mixer and AD converter. The fundamental tone is extracted using FFT on a PC. At the same time, the camera captures the player’s movements, including signals indicating the start or end of the performance. The movement tracking data is analyzed on the PC, and machine learning is used to learn and estimate the hit points. As described in the previous section, Python was used as the programming language.
We opted for MediaPipe as our tracking system due to its commendable frame rate, tracking stability, and an extensive array of track points, totaling 33. MediaPipe was installed using the pip command without any special procedures as described on the official site [17]. MediaPipe includes various functions, but the function used in this study is the Pose landmark detection function, which detects the skeletal structure of the face, body, hands, and feet. The two parameters for skeletal detection were set as follows: the minimum confidence score for pose detection to be considered successful was set to 0.5, and the minimum confidence score for pose tracking to be considered successful was set to 0.5. The player motion detection method is shown in Figure 4. We chose to detect time-series motion using the scikit-lean multilayer perceptron (MLP) classifier, which is a well-known machine learning library in Python. The aim is to identify the start of the performance (when sound transitions from silence to an acoustic state), labeled 1, and the end of the performance (when sound shifts from an acoustic state to silence), labeled 2, during training. All other instances are labeled 0. When features are input into the input layer, multiple neurons in the hidden layer process them, resulting in predictive discrimination outcomes in the output layer.
A flow diagram of the training process using the MLP classifiers is shown in Figure 5. The 3D coordinates of the historical skeleton within the range specified by the frame number
In the past, only the prediction of the performance start point was reported [19], but this paper describes the results of the prediction of both the performance start and end points. The actions of the player at the start and end of the performance are shown in Figures 6 and 7, respectively. The cue at the start of the performance is almost exclusively a vertical movement, while the cue at the end of the performance includes an elliptical left–right movement. The results of skeletal detection during this operation are shown in a time series in Figure 8. The coordinate axes are positive in the
The parameters of the MLP classifier are considered through a grid search. Optimization methods such as Adam, L-BFGS, and SGD are considered along with activation functions like identity, logistic, ReLU, and tanh. In terms of features, the frame counts for tracking coordinates are configured to 10, 20, and 30 frames with and without preprocessing (relative coordinate conversion). The coordinates of the player’s entire body are included as 3D coordinates for each joint, with the midpoint of the waist serving as the origin. While the previous report [19] dealt only with the performance start point, this paper deals with both the performance start point and the performance end point. Targets are labeled 1 for the moment when the player initiates music, 2 for the moment when the player concludes, and 0 for all other instances. The dataset consists of 20 sessions, with 10 allocated for training and the remaining 10 reserved for prediction. Timing detection is considered accurate when within ±0.5 s. The prediction is true positive (TP) if the player’s action was predicted at a time when it should have been detected. It is false positive (FP) if it was predicted when it should not have been detected. It is false negative (FN) if it was not predicted when it should have been detected. The following equations were used to evaluate accuracy and precision. In the present study, the true negative (TN) was not used because it would have been irrational to include it in the calculations due to the long period of time in which the player was not moving or was in a state of continuous motion.
Frame number | Coordinate | Hidden layer | Activation function | Optimizer | Start [%] | End [%] | Predicted time [ms] | ||
---|---|---|---|---|---|---|---|---|---|
Acc. | Prec. | Acc. | Prec. | ||||||
10 | Absolute | 2, 10 | identity | L-BFGS | 72.7 | 88.9 | 20.0 | 100.0 | 3.3 |
10 | Relative | 9, 10 | identity | L-BFGS | 71.4 | 71.4 | 0.0 | — | 4.0 |
10 | Relative | 100, 10 | ReLU | L-BFGS | 70.0 | 100.0 | 100.0 | 100.0 | 5.4 |
10 | Relative | 100, 90 | identity | L-BFGS | 66.7 | 80.0 | 0.0 | — | 5.2 |
10 | Relative | 5, 10 | identity | L-BFGS | 66.7 | 66.7 | 0.0 | — | 3.8 |
10 | Relative | 100, 60 | identity | L-BFGS | 64.3 | 69.2 | 0.0 | — | 4.8 |
10 | Relative | 20, 100 | identity | L-BFGS | 64.3 | 69.2 | 0.0 | — | 3.6 |
10 | Relative | 1000, 400 | tanh | L-BFGS | 64.3 | 69.2 | 0.0 | — | 41.7 |
10 | Relative | 1000, 400 | identity | L-BFGS | 61.5 | 72.7 | 0.0 | — | 29.2 |
10 | Relative | 70, 100 | ReLU | L-BFGS | 60.0 | 100.0 | 100.0 | 100.0 | 5.3 |
Frame number | Coordinate | Hidden layer | Activation function | Optimizer | Start [%] | End [%] | Predicted time [ms] | ||
---|---|---|---|---|---|---|---|---|---|
Acc. | Prec. | Acc. | Prec. | ||||||
10 | Relative | 100, 10 | ReLU | L-BFGS | 70.0 | 100.0 | 100.0 | 100.0 | 5.4 |
10 | Relative | 70, 100 | ReLU | L-BFGS | 60.0 | 100.0 | 100.0 | 100.0 | 5.3 |
10 | Relative | 100, 100 | ReLU | L-BFGS | 50.0 | 100.0 | 100.0 | 100.0 | 5.5 |
10 | Absolute | 200, 1000 | ReLU | L-BFGS | 36.4 | 80.0 | 100.0 | 100.0 | 19.8 |
10 | Absolute | 10, 3 | tanh | L-BFGS | 30.0 | 100.0 | 100.0 | 100.0 | 3.6 |
10 | Absolute | 100, 70 | ReLU | L-BFGS | 27.3 | 75.0 | 100.0 | 100.0 | 5.5 |
20 | Relative | 10, 8 | ReLU | L-BFGS | 21.4 | 42.9 | 100.0 | 100.0 | 5.8 |
20 | Relative | 100, 70 | ReLU | L-BFGS | 21.4 | 42.9 | 100.0 | 100.0 | 9.7 |
10 | Absolute | 60, 100 | ReLU | L-BFGS | 20.0 | 100.0 | 100.0 | 100.0 | 5 |
10 | Relative | 200, 1000 | ReLU | L-BFGS | 20.0 | 100.0 | 100.0 | 100.0 | 18.1 |
However, a traditional method for detecting changes is to differentiate the change in coordinates and detect them with a certain threshold value. Figure 10 shows a time-differentiated graph of the skeletal coordinate output from MediaPipe for prediction. The horizontal axis represents time, the left vertical axis represents the time derivative of the 3D coordinate transition of each skeleton from ID0 to ID22, and the scale of right vertical axis is target (0, 1, or 2), which indicates the start and end points of the performance. Although there are some areas where the derivative peaks near the target, we can see that it actually captures changes during the swinging motion prior to the hitting point and that it is difficult to set a certain threshold value. These results also demonstrate the effectiveness of the proposed method for learning the temporal transitions of skeletal coordinates.
The following measures can be considered to improve the detection accuracy of the start time of a performance in the future. The first is the addition of training data.
Second, each joint coordinate is weighted to increase the sensitivity to the vertical motion of the hands and face. Some joints do not fluctuate significantly in response to movement, whereas others fluctuate significantly. In this experiment, all joints were treated with the same weights; however, increasing the weights of the parts with large fluctuations is expected to improve the accuracy of motion detection.
The third step is the addition of breath sounds and acceleration sensors. In the case of wind instruments, a breath sound is generated before performance. If the resolution of the image is insufficient, the accuracy can be improved by adding not only the image but also the output of the acceleration sensor to the learning process to make it multimodal.
To achieve highly cooperative tasks between humans and robots in the future, this paper proposed a method for detecting actions related to musical performances. The following conclusions were drawn.
To track the performer’s movements, we evaluated various methods including 2D image processing and tracking methods such as cascade, MIL, and KCF as well as skeletal detection by MediaPipe. Our evaluation showed that skeletal detection by MediaPipe was the most appropriate method due to its stability and speed performance.
An algorithm was implemented to estimate the start and end points of the performer’s performance by learning and estimating the time transition of the skeletal coordinates detected by MediaPipe using a deep neural network.
A 10-set training data and 10-set test data estimation experiment was conducted. The start point of the performance had an accuracy of 70%, while the end point had an accuracy of 100%.
In the future, the accuracy of estimation will be improved and integrated with the performance system.
This work was supported by JSPS Grants-in-Aid for Scientific Research, Grant Number JP19K04300.
The authors would like to thank alumnus Ryusuke Ishikawa for building the experimental apparatus and conducting the experiments.
The authors declare no conflict of interest.
Written by
Article Type: Research Paper
Date of acceptance: August 2024
Date of publication: August 2024
DOI: 10.5772/acrt.20240002
Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0
© The Author(s) 2024. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Impact of this article
261
Downloads
162
Views
1
Altmetric Score
Join us today!
Submit your Article