Sneak peek at Machine Learning Paradigm in Biomedical Signals-Part I

3 min readMar 31, 2021

For a beginner in machine learning (ML), it might occur that framework of machine learning is the same across all sorts of research problems. However, there are some aspects unique to the medical domain for example the limited amount of data, which drives us to take steps usually not followed in other domains. Data collection in the medical domain is a time-consuming, tedious task and thus most of the studies are performed with subject counts below 100. Even a count of 200 can be considered as a mark of an abundant database. This would appear as an insignificant size for those working on natural images or NLP. This article compiles a basic paradigm in initial building blocks to be followed for biosignals for the ease of performing experiments. For detailed reading refer to the book by Rangayan.

Preprocessing- The preprocessing of the time series data needs to be performed using the physiology of the body organ/source generating the signals. Low-pass, high pass, bandpass filtering as per the literature can be used to find the range of cut-off frequency. Ex- heart sounds have a spectrum up to 1000 Hz, ECG signal is limited to 100 Hz and respiratory/lung sound signal can be spread till 2kHz. Further filtering based on power line interference ( the 50 Hz or 60 Hz frequency depending on the country/region of data acquisition) can be performed if required. It should be noted that the above-mentioned filtering assumes noise is additive in nature as well as the spectrum of noise and signal are limited in bandwidth. If the signal and noise are multiplicative or convoluted, homomorphic filtering (logarithm operation is performed to make signal and noise additive) is performed. Other sorts of filtering could be to identify inherent interferences/noise present in signals. A lung sound can act as noise for heart sound and vice-versa. Eye movements can act as interference in EEG signals. These noises which are embedded in the signal itself and share an overlapping spectrum can be removed using advanced signal processing methods such as SVD, ICA, EMD to name a few. Min-max, z-score, and absolute value normalization are the next step to confine the values of signal throughout the database within a range as well as remove any variation due to a diverse set of subjects.
Segmentation/Epoch formation- I would consider this as a crucial step for the upcoming phase of feature extraction. This step not only enables us to increase the size of the database as well as determines what is the basis of feature extraction. Suppose for each subject 1 min of ECG is available. If we decide to use the entire 1 min of ECG for feature extraction, then we will be available with just 1 instant/subject. Moreover, if the subject count is less, then this will lead to data size difficult to model (Imagine with 50 subjects you have just 50 instants to train, test, and validate !). On the contrary, if 60 segments of 1 sec are used, this will lead to a 60 x increase in data size, for training and validation. The option of selecting small segments leads to another question what should be the optimal size/duration of the fragment. Here, one can prefer the option of selecting the duration based on physiology. For example, select 1–2 consecutive heartbeats/cycle in ECG and heart sound/ lung sound. This will ensure that we are using consistent content across all subjects. This strategy is followed more for signals where the periodicity of the signal varies among subjects. For stimulus-based studies, a constant duration can be used across the entire database. For example ±t sec of data from the point of stimulus.

In the next article we will follow up with the classification paradigm in detail…stay tuned…Feel free to provide feedback in the comments.

Sneak peek at Machine Learning Paradigm in Biomedical Signals-Part I

Written by AKANKSHA PATHAK