This file contains the data from doi.10.5061 dryad jwstqjqf7 and also the code from the Authors' repository:
folderSummary.pycollates the data from the folderdoi_10_5061_dryad_jwstqjqf7__v20230403/PlantSoundsand calculates the general statistics that do not require audio processing, such as 'number of sounds':helperFunctions.pyis a collection of helpers to:- Import from specific directories/wrap functions to import WAV files
- Repair & Pre-process the signals
- Downsample or process for plotting/visualisations
statFunctions.pycontainscalcE(X)and similar functions to calculate the statistics of either a single WAV file, or a number of them collected into a matrix- In
exploration.pywhole signal statistics (Energy, Entropy, Dominant frequency etc) are calculated and pairwise plots are made. Quirks of the dataset are also explored, such as identifying recordings that have clipping or contain particularly high/low frequencies - [TODO:
svmTrain(X, y)trains an SVM classifier on the (normalised) WAV data held in X and the labels held in y. It then prints a confusion matrix & plots a visualisation of the result. Use this function on 'basic', 'mfcc' and then 'scatNet' features of the data]
First looping through the treatment folders, calculating the following statistics:
- Number recordings per treatment
- Number of plants per treatment
- Number of recordings per individual plant
While we're at it we will:
- Saves the data to
.csv - Creates a histogram plot of number of sounds per plant for each treatment
- Runs a Wilcoxon Rank Sum test for the both tomato and tobacco for the hypothesis that 'cut' and 'dry' plants produce the same number of sounds per plant per time period
- Runs an alternative 'Poisson' model which is a parametric alternative to the Wilcoxon test. [TODO: This didn't fit very well as
$\sigma^2 \ne \mu$ ! However, there are alternatives to look into for both zero-inflated and overdispersed ($\sigma^2 \gg \mu$ ) situations]
The format is id_{plantID}_{sound}_{soundID}WAV and we will check whether the sound IDs are continuous, then plot a histogram of number of sounds for each plant in each treatment:
While sound ID always starts from 1 and increases in a contiguous manner, the plantID is not contiguous. As we can see if we do a groupby('plant_id').size() and also groupby('plant_id').max() etc. for each separate treatment. (See table 1!)
| experiment | n_recordings_total | n_plants | avg_sounds_per_plant | max_plant_id | corrected_sounds_per_plant |
|---|---|---|---|---|---|
| Empty Pot | 1036 | 1 | 1036 | 0 | 0 |
| Greenhouse Noises | 1378 | 4 | 344 | 3 | 459 |
| Tobacco Cut | 506 | 28 | 18 | 173 | 2 |
| Tobacco Dry | 275 | 35 | 7 | 89 | 3 |
| Tomato Cut | 660 | 43 | 15 | 239 | 2 |
| Tomato Dry | 1622 | 64 | 25 | 119 | 13 |
This raises question number 1:
Q1: "do the plants indicated by the missing IDs exist, and if so have the emitted zero sounds? And if so, how many plants were tested of each treatment? The Tomatoes and Tobacco plants seem as though they might be on continuous scale: 180 total Tobacco plants (90 each of dry and cut) and 240 total Tomato plants (120 each of cut and dry), which is a LOT of plants?!"
We would now like to analyse the data in terms of 'rates' of acoustic emissions. In the paper, a Wilcoxon Rank Sum Test is used, which is very easy to execute with scipy. First let's plot histograms to look at the distribution of 'number of acoustic emissions per plant.' Assume that there are plants for which no sounds were recorded, and that the group sizes are 120 for tomatoes, and 90 for tobacco, as mentioned before.
Looking at the x-axes, there is much greater spread for tomatoes in the 'Dry' treatment, similar spread for both tobacco and tomato in the 'Cut' group, and lowest spread in 'Tobacco Dry'. Initially I ran the Wilcoxon test on this data in its 'raw' form, getting p<0.05 for Tobacco, but really high p (p>0.5?!!) for the tomatoes. HOWEVER, according to Q1, once the data has been padded with zeroes to make up for 'silent' plants, we then get a conflicting result that for Tomato: Cut vs. Dry -> p=0.0138; and Tobacco: Cut vs. Dry -> p = 0.611. This using the 'auto' method for SciPy's mannwhitneyu function.
Once we've looked at the overall picture of how many recordings have been taken in the dataset, the next thing to look at is the 'Intensity' and dominant frequencies in the signals, since the paper reports average intensities & dominant frequencies for the various groups. The data provided seems to match the dominant frequency result, but the Intensities reported don't appear to match. This looks as though it could be a mistake in my code, but does also reveal that many of the recordings are clipped for all but the 'Tomato Dry', which will significantly may help significantly in distinguishing Cut vs. Dry Tomato recordings as it will introduce high frequency harmonics
From (Khait, 2024) Produced by `exploration.py`we would like to reproduce the classification algorithm. In the code repository for the paper there is only a template function rather than complete code. There are 3 methods used here for choosing the features to classify on:
- Basic: Energy
$E$ , Energy Entropy$H_e$ , Spectral Entropy$H_f$ , (dominant?) Maximum Frequency$F_{dom}$ - MFCC: its not obvious why this set of features would be used since they have been designed for human speech [TODO: probably don't bother with this!]
- Scattering Network: This is an interesting wavelet-based transform, so will provide a similar featuring mechanism to frequency-based statistics like
$H_f$ and$F_{dom}$
For Basic feature extraction the implementations have been done using generic functions from packages like scipy. The packages Antropy/spkit contain some useful further functions [TODO: re-run pairplots using some whacky functions from these places!].
Q2: What is Energy Entropy?? I have made a guess at its definition as being the entropy of the distribution of energy over the signal in time, so I have done the following:
- Windowed the signal (boxcar) and calculated the Energy (using the existing function)
- Cycled over the whole signal (1 step at a time, so large overlap!) repeating the process
- Energy values produced are binned into a normalised histogram, ditching the time information
- The generic entropy function is applied to the discrete distribution represented by the histogram
The window size chosen was 64 samples, on the basis that 256 was too large to show information in the small 'pulse' which is in the middle of the recording. 64 seemed a reasonable balance. [TODO: another very similar approach would be to take the Hilbert Transform to get the 'Acoustic Entropy Index' described here: Rapid Acoustic Survey For Biodiversity Appraisal]
To get an idea of what the features look like and whether they are going to immediately useful for decisions, plot them in pairs for the different treatments:
Produced by `exploration.py`Q3: It's not clear how the dataset was collected. Is the data included 1 hour for each plant ID? If not, does that mean there are recordings from wetted up plants in the 'Dry' folders? How many recording sessions is the 'Empty Pot' folder from? In the paper it says that no sounds were recorded from the Empty Pot in >500hrs of recording, so what are these sounds?
Q4: After using the Scatnet function in (presumably in MATLAB?) was there a PCA stage to extract features from the transformed data? It seems as though a wavelet transform wouldn't lower the dimensionality of the data...






