Skip to content

gcliddellSoton/Plant-Sound-Analysis-Test2

Repository files navigation

KhaitReproduction

This file contains the data from doi.10.5061 dryad jwstqjqf7 and also the code from the Authors' repository:

  • folderSummary.py collates the data from the folder doi_10_5061_dryad_jwstqjqf7__v20230403/PlantSounds and calculates the general statistics that do not require audio processing, such as 'number of sounds':
  • helperFunctions.py is a collection of helpers to:
    • Import from specific directories/wrap functions to import WAV files
    • Repair & Pre-process the signals
    • Downsample or process for plotting/visualisations
  • statFunctions.py contains calcE(X) and similar functions to calculate the statistics of either a single WAV file, or a number of them collected into a matrix
  • In exploration.py whole signal statistics (Energy, Entropy, Dominant frequency etc) are calculated and pairwise plots are made. Quirks of the dataset are also explored, such as identifying recordings that have clipping or contain particularly high/low frequencies
  • [TODO: svmTrain(X, y) trains an SVM classifier on the (normalised) WAV data held in X and the labels held in y. It then prints a confusion matrix & plots a visualisation of the result. Use this function on 'basic', 'mfcc' and then 'scatNet' features of the data]

Summarising The Data

First looping through the treatment folders, calculating the following statistics:

  1. Number recordings per treatment
  2. Number of plants per treatment
  3. Number of recordings per individual plant

While we're at it we will:

  1. Saves the data to .csv
  2. Creates a histogram plot of number of sounds per plant for each treatment
  3. Runs a Wilcoxon Rank Sum test for the both tomato and tobacco for the hypothesis that 'cut' and 'dry' plants produce the same number of sounds per plant per time period
  4. Runs an alternative 'Poisson' model which is a parametric alternative to the Wilcoxon test. [TODO: This didn't fit very well as $\sigma^2 \ne \mu$! However, there are alternatives to look into for both zero-inflated and overdispersed ($\sigma^2 \gg \mu$) situations]

The format is id_{plantID}_{sound}_{soundID}WAV and we will check whether the sound IDs are continuous, then plot a histogram of number of sounds for each plant in each treatment:

Produced by `folderSummary.py`

While sound ID always starts from 1 and increases in a contiguous manner, the plantID is not contiguous. As we can see if we do a groupby('plant_id').size() and also groupby('plant_id').max() etc. for each separate treatment. (See table 1!)

experiment n_recordings_total n_plants avg_sounds_per_plant max_plant_id corrected_sounds_per_plant
Empty Pot 1036 1 1036 0 0
Greenhouse Noises 1378 4 344 3 459
Tobacco Cut 506 28 18 173 2
Tobacco Dry 275 35 7 89 3
Tomato Cut 660 43 15 239 2
Tomato Dry 1622 64 25 119 13

This raises question number 1:

Q1: "do the plants indicated by the missing IDs exist, and if so have the emitted zero sounds? And if so, how many plants were tested of each treatment? The Tomatoes and Tobacco plants seem as though they might be on continuous scale: 180 total Tobacco plants (90 each of dry and cut) and 240 total Tomato plants (120 each of cut and dry), which is a LOT of plants?!"

We would now like to analyse the data in terms of 'rates' of acoustic emissions. In the paper, a Wilcoxon Rank Sum Test is used, which is very easy to execute with scipy. First let's plot histograms to look at the distribution of 'number of acoustic emissions per plant.' Assume that there are plants for which no sounds were recorded, and that the group sizes are 120 for tomatoes, and 90 for tobacco, as mentioned before.

Produced by `folderSummary.py`

Looking at the x-axes, there is much greater spread for tomatoes in the 'Dry' treatment, similar spread for both tobacco and tomato in the 'Cut' group, and lowest spread in 'Tobacco Dry'. Initially I ran the Wilcoxon test on this data in its 'raw' form, getting p<0.05 for Tobacco, but really high p (p>0.5?!!) for the tomatoes. HOWEVER, according to Q1, once the data has been padded with zeroes to make up for 'silent' plants, we then get a conflicting result that for Tomato: Cut vs. Dry -> p=0.0138; and Tobacco: Cut vs. Dry -> p = 0.611. This using the 'auto' method for SciPy's mannwhitneyu function.

Signal Statistics

Once we've looked at the overall picture of how many recordings have been taken in the dataset, the next thing to look at is the 'Intensity' and dominant frequencies in the signals, since the paper reports average intensities & dominant frequencies for the various groups. The data provided seems to match the dominant frequency result, but the Intensities reported don't appear to match. This looks as though it could be a mistake in my code, but does also reveal that many of the recordings are clipped for all but the 'Tomato Dry', which will significantly may help significantly in distinguishing Cut vs. Dry Tomato recordings as it will introduce high frequency harmonics

From (Khait, 2024)

Produced by `exploration.py`

Classification

we would like to reproduce the classification algorithm. In the code repository for the paper there is only a template function rather than complete code. There are 3 methods used here for choosing the features to classify on:

  • Basic: Energy $E$, Energy Entropy $H_e$, Spectral Entropy $H_f$, (dominant?) Maximum Frequency $F_{dom}$
  • MFCC: its not obvious why this set of features would be used since they have been designed for human speech [TODO: probably don't bother with this!]
  • Scattering Network: This is an interesting wavelet-based transform, so will provide a similar featuring mechanism to frequency-based statistics like $H_f$ and $F_{dom}$

For Basic feature extraction the implementations have been done using generic functions from packages like scipy. The packages Antropy/spkit contain some useful further functions [TODO: re-run pairplots using some whacky functions from these places!].

Q2: What is Energy Entropy?? I have made a guess at its definition as being the entropy of the distribution of energy over the signal in time, so I have done the following:

  1. Windowed the signal (boxcar) and calculated the Energy (using the existing function)
  2. Cycled over the whole signal (1 step at a time, so large overlap!) repeating the process
  3. Energy values produced are binned into a normalised histogram, ditching the time information
  4. The generic entropy function is applied to the discrete distribution represented by the histogram

The window size chosen was 64 samples, on the basis that 256 was too large to show information in the small 'pulse' which is in the middle of the recording. 64 seemed a reasonable balance. [TODO: another very similar approach would be to take the Hilbert Transform to get the 'Acoustic Entropy Index' described here: Rapid Acoustic Survey For Biodiversity Appraisal]

Basic Features

To get an idea of what the features look like and whether they are going to immediately useful for decisions, plot them in pairs for the different treatments:

Produced by `exploration.py`

Q3: It's not clear how the dataset was collected. Is the data included 1 hour for each plant ID? If not, does that mean there are recordings from wetted up plants in the 'Dry' folders? How many recording sessions is the 'Empty Pot' folder from? In the paper it says that no sounds were recorded from the Empty Pot in >500hrs of recording, so what are these sounds?

Q4: After using the Scatnet function in (presumably in MATLAB?) was there a PCA stage to extract features from the transformed data? It seems as though a wavelet transform wouldn't lower the dimensionality of the data...

About

A partial reproduction of the analysis in https://doi.org/10.1016/j.cell.2023.03.009

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages