Genetic algorithm to detect patterns on noisy data

This article shows the usage of a genetic algorithm to detect patterns on noisy data having special constraints: the linear acceleration obtained by smartphone sensors. I explain how I try to improve its coherence helped by a genetic algorithm, relevant for its ability of adaptation.

You can already visualize the result of this implementation via JSFiddle.

The linear acceleration of motion is equivalent to the acceleration excluding the component of gravity. This result is often obtained by sensor fusion (accelerometer, gyroscope, magnetometer, etc.) and by the use of sign processing approaches (Kalman filter, particle filter, etc.). On Android, linear acceleration is accessible through SensorManager with the virtual sensor Sensor.TYPE_LINEAR_ACCELERATION.

The premise : linear acceleration isn’t consistent

Linear acceleration is imprecise, mostly noisy due to the performed operations over other sensors values to compute it and is especially inconsistent. Consider the following example which represents a simple movement of a smartphone at arm’s length:

Linear acceleration data set
Linear acceleration example of a rectilinear movement: the same pattern appears twice.

It’s possible to identify the same repeating pattern represented by the two turquoise zones, corresponding to the beginning of the movement (zone 1 – the telephone gets a certain speed and moves) and then to its end (zone 2 the telephone loses its speed). However, here are the following inconsistencies:

  • The sum of the positive and negative accelerations does not cancel each other out: we will therefore try to construct a model where the acceleration is zero.
  • The curve is interspersed with very close extreme values: we must model the acceleration constantly over a period to work better with.


MODEL Construction

In a way, I want to change my complex and hard-to-exploit movement into a uniformly accelerated rectilinear movement:

Schematisation of our acceleration and model
Acceleration will look like a squared signal. a1 and a2 are the areas under the curve.

Thus, I will obtain a constant acceleration over a longer period and easier to exploit, but with additional constraints (following the two inconsistencies cited above):

  • a1 = a2 because the sum of the accelerations must be null,
  • a1 and a2 must approach as much as possible the sum of the accelerations between the terminals.

Let h1 be first area’s height and h2 respectively the second, we have a1 = h1*(x2-x1) = h2*(x3-x2) = a2. There is then an infinity of possible combinations as a function of the variables x1, x2, x3, h1 and h2 there is no suitable deterministic algorithm and has a reasonable computational cost.

And for this reason, it’s interesting to use a genetic algorithm to determine the best solution. Additionally, its adaptability will be appreciated in order to take into account the accelerations coming from different sensors of different smartphones, or to be more flexible with the pattern recognition.

A genetic algorithm to find the best solution

The aim of a genetic algorithm is to provide a solution inspired by natural evolution in order to solve an optimization problem difficult to approach and which depends upon many parameters. Then we select into a population of individuals (chromosomes) those who are the most able to survive and as the generation number goes up, approach a solution close to the desired result.

Random generation

Every generation is made of 15 individuals. Using the diagram above, I modeled the individuals DNA of our genetic algorithm in the following way:

DNA: {
    pivots: [x1: float, x2: float, x3: float],
    h1: float,
    h2: float,

When creating a new individual, the three pivots are -max <= h1 <= max, h1 != 0 with 0 <= x1 < x2 < x3 <= size-1 (where size is the of the data set), we also have h1 such that -max <= h1 <= max, h1 != 0 and h2 = -h1 * (x2 - x1) / (x1 - x0) given max the absolute maximum value of the data set.


Selection is crucial to rapidly reaching towards a great solution:

  1. Crossover of the two individuals if they’re not (or almost not) equivalent (rank selection), otherwise crossing the best and a second random one (almost uniform selection).
  2. Suppression of the two worst individuals.
  3. Insertion of two new randomly generated individuals.

The insertion of two completely random individuals is important to ensure that the algorithm doesn’t converge towards a wrong solution and be trapped in a local extremum (additionally to mutation).


Probability graph for an individual to mutate over a population of 15
Probability graph to mutate for 15 individuals

The mutation probability of an individual is calculated according to its rank, given that the lowest rank is the best: p = (n/nmax)² with n the rank and nmax the maximum rank.

The probability is quadratic to emphasize the need to mutate individuals who have a bad score.

DNA mutation can happen in four different ways, of equal probability each:

  • x1 takes a new value in the interval [0 ; x2[ [0 ; x2[
  • x2 takes a new value in the interval ]x1 ; x2[ ]x1 ; x2[
  • x3 takes a new value in the interval ]x2 ; size-1] ]x2 ; size-1]
  • h1 is re-drawn randomly in the same way as the creation (and therefore h2 is also modified).


Fitness function

The fitness function determines how well an individual meets the requirements and assigns its score which eventually determines its rank. The fitness function should be able to be calculated quickly while being relevant for determining what the right solutions are.

Let inPercentage the percentage of area under the curve between x1 and x3 (the terminals), voidPercentage the percentage of “vacuum” in the areas a1 and a2 between the terminals:

fitness = (100 - inPercentage)^inAreaImportance + voidPercentage^2

inPercentage and voidPercentage calculation

A good individual is indeed an individual who possesses a low fitness score. Thus, more inPercentage tends towards 100 and voidPercentage tends towards 0 plus the score decreases, the more the individual gets a better rank. The greater the inPercentage greater the inPercentage becomes crucial in the total score: this allows giving more importance to the values correctly between the (inPercentage) in spite of the proportion of void (voidPercentage) that this entails.

This makes it possible to take into account very incoherent curves with missing values but which represent, as much as possible, a large movement. Overall, my tests pointed out that inAreaImportance = 6 was a value that allowed good results.

Additionally, I apply a penalty on the score such as fitness *= _UNFIT_COEFF as well as an attribute of non-eligibility if the individual does not meet the following conditions:

  • a1 = 0 or a2 = 0 ,
  • x3 - x1 < _MIN_WIDTH where _MIN_WIDTH is the minimum acceptable width,
  • |h1| > max |h1| > max or |h2| > max |h2| > max where max is the maximum absolute value of the data set.

Why not directly eliminate the individual rather than assign a penalty? It is possible that a cross with another individual or his mutation results in an individual who has a good score and meets the criteria.


It only remains to determine when it is no longer necessary to continue to seek a solution or in other words, when the best individual (the elite) has a great chance of being the best solution obtainable. For this, I use several parameters whose values have been determined after trial and error:

  • The elite fulfills the additional criteria described above (eligibility)
  • The elite has gone through more than 15 generations (relevance in time),
  • The elite has sufficient inPercentage and voidPercentage values (respectively above 65% and below 80% – relevance of the result).

Obtained result

I built a web-app which allows to simulate the flow returned by the accelerometer of a smartphone and to apply the genetic algorithm in real time (but slowed down). Below is an example of what I get for a certain dataset:

Obtained result at the launch of the visualization
The visualization application will ultimately display the individual selected as the best

You can view the simulation via JSFiddle. It’s also possible to dynamically modify certain parameters via in-app controls that determine the behavior of the algorithm as explained above.

Benefits / Conclusion

Overall, the algorithm gives good results when there is good reason to have good results, i.e. when the accelerations are exploitable (true positives). Otherwise, the results are very hazardous and do not correspond to anything (false positives).

The advantage is that by repeatedly repeating the algorithm on the same data set, the same true positive ones are obtained several times, while the false positives appear sporadically. Repetition would then allow us to isolate the true positives and to support the certainty of obtaining a correct result.

I’m not sure that a genetic algorithm is completely relevant for this problem I think that other heuristics based on average can be just as efficient in requiring less computing time. Nevertheless, as explained above, the probabilistic side of the genetic algorithm allows us to isolate the true positive from false positives.

Finally, the genetic algorithm as I conceived it does double work: linear regression but also selection of data that is worth it, and this will be more complicated given the complexity of the patterns to be detected in the acceleration. On the other hand, it would be possible to imagine a system with a similar purpose that would perform a classification beforehand to select the correct data and then apply a heuristic algorithm on it.


3 thoughts on “Genetic algorithm to detect patterns on noisy data

  1. Sympa ! Il n’y a pas un risque en cas d’un pic de valeur absurde (défaut capteur) ? Si on ne le prenait pas en compte dans le calcul de l’aire, inPercentage serait très bas. Or inPercentage semble très important dans le calcul de la fitness (inAreaImportance = 6 comparé au ^2 de voidPercentage), peut être que l’algo serait tenté de produire une aire plus verticale (h2 >> x3-x2) afin de maximiser inPercentage et donc de produire un résultat faux à tous les coups.

    1. Merci d’avoir lu, et d’avoir commenté !
      Ça dépend pas mal. Le fait que l’algorithme soit exécuté “real-time” complique pas mal les choses.

      Si le pic est tout début (par exemple les 5 premières valeurs) alors je pense qu’en effet l’algorithme trouve une solution très verticale (#1). Et encore, si la hauteur h1 ou h2 est très importante, alors il faudra compenser par une distance x3-x1 proportionnellement importante, or x3 et x1 sont bornés par le set de données.

      Si le pic est au milieu du set de donnée, en effet l’algorithme tend à l’inclure et du coup, ne prend pas en compte une meilleure solution sans ce pic. Par contre, vu que le pourcentage de vide augmente énormément, la solution n’est pas choisie. Donc en soit on évite la casse car on ne valide aucune solution, mais on n’en prend aucune non plus, ce qui est problématique. Exemple :

      Je pense qu’il faudrait baisser inAreaImportance mais appliquer un contrôle plus stricte sur la validation finale.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.