This article shows the usage of a genetic algorithm to detect patterns on noisy data having special constraints: the linear acceleration obtained by smartphone sensors. I explain how I try to improve its coherence helped by a genetic algorithm, relevant for its ability of adaptation.

**You can already visualize the result of this implementation via JSFiddle.**

The linear acceleration of motion is equivalent to the acceleration excluding the component of gravity. This result is often obtained by sensor fusion (accelerometer, gyroscope, magnetometer, etc.) and by the use of sign processing approaches (Kalman filter, particle filter, etc.). On Android, linear acceleration is accessible through `SensorManager`

with the virtual sensor `Sensor.TYPE_LINEAR_ACCELERATION`

.

## The premise : linear acceleration isn’t consistent

Linear acceleration is **imprecise**, mostly **noisy** due to the performed operations over other sensors values to compute it and is especially** inconsistent**. Consider the following example which represents a simple movement of a smartphone at arm’s length:

It’s possible to identify the same repeating *pattern* represented by the two turquoise zones, corresponding to the beginning of the movement (zone 1 – the telephone gets a certain speed and moves) and then to its end (zone 2 – the telephone loses its speed). However, here are the following inconsistencies:

**The sum of the positive and negative accelerations does not cancel each other out:**we will therefore try to construct a model where the acceleration is zero.**The curve is interspersed with very close extreme values:**we must model the acceleration constantly over a period to work better with.

### MODEL Construction

In a way, I want to change my complex and hard-to-exploit movement into a **uniformly accelerated rectilinear movement**:

Thus, I will obtain a constant acceleration over a longer period and easier to exploit, but with additional constraints (following the two inconsistencies cited above):

`a1 = a2`

because the sum of the accelerations must be null,`a1`

and`a2`

must approach as much as possible the sum of the accelerations between the terminals.

Let `h1`

be first area’s height and `h2`

respectively the second, we have `a1 = h1*(x2-x1) = h2*(x3-x2) = a2`

. There is then an infinity of possible combinations as a function of the variables `x1, x2, x3, h1`

and `h2`

there is no suitable deterministic algorithm and has a reasonable computational cost.

And for this reason, it’s interesting to use a genetic algorithm to determine the best solution. Additionally, its adaptability will be appreciated in order to take into account the accelerations coming from different sensors of different smartphones, or to be more flexible with the pattern recognition.

## A genetic algorithm to find the best solution

The aim of a genetic algorithm is to provide a solution inspired by natural evolution in order to solve an optimization problem difficult to approach and which depends upon many parameters. Then we select into a population of individuals (*chromosomes*) those who are the most able to survive and as the generation number goes up, approach a solution close to the desired result.

### Random generation

Every generation is made of 15 individuals. Using the diagram above, I modeled the individuals DNA of our genetic algorithm in the following way:

DNA: { pivots: [x1: float, x2: float, x3: float], h1: float, h2: float, }

When creating a new individual, the three pivots are `-max <= h1 <= max, h1 != 0`

with `0 <= x1 < x2 < x3 <= size-1`

(where `size`

is the of the data set), we also have `h1`

such that `-max <= h1 <= max, h1 != 0`

and `h2 = -h1 * (x2 - x1) / (x1 - x0)`

given `max`

the absolute maximum value of the data set.

### Selection

Selection is crucial to rapidly reaching towards a great solution:

**Crossover**of the two individuals if they’re not (or almost not) equivalent (*rank selection*), otherwise crossing the best and a second random one (*almost uniform selection*).**Suppression**of the two worst individuals.**Insertion**of two new randomly generated individuals.

The insertion of two completely random individuals is important to ensure that the algorithm doesn’t converge towards a wrong solution and be trapped in a local extremum (additionally to mutation).

### Mutation

The mutation probability of an individual is calculated according to its rank, given that the lowest rank is the best: `p = (n/nmax)²`

with `n`

the rank and `nmax`

the maximum rank.

The probability is quadratic to emphasize the need to mutate individuals who have a bad score.

DNA mutation can happen in four different ways, of equal probability each:

`x1`

takes a new value in the interval`[0 ; x2[`

`[0 ; x2[`

`x2`

takes a new value in the interval`]x1 ; x2[`

`]x1 ; x2[`

`x3`

takes a new value in the interval`]x2 ; size-1]`

`]x2 ; size-1]`

`h1`

is re-drawn randomly in the same way as the creation (and therefore`h2`

is also modified).

### Fitness function

The fitness function determines how well an individual meets the requirements and assigns its **score** which eventually determines its rank. The fitness function should be able to be calculated quickly while being relevant for determining what the right solutions are.

Let `inPercentage`

the percentage of area under the curve between `x1`

and `x3`

(the terminals), `voidPercentage`

the percentage of “vacuum” in the areas `a1`

and `a2`

between the terminals:

`fitness = (100 - inPercentage)^inAreaImportance + voidPercentage^2`

A good individual is indeed an individual who possesses a low *fitness score*. Thus, more `inPercentage`

tends towards 100 and `voidPercentage`

tends towards 0 plus the score decreases, the more the individual gets a better rank. The greater the `inPercentage`

greater the `inPercentage`

becomes crucial in the total score: this allows giving more importance to the values correctly between the (`inPercentage`

) in spite of the proportion of void (`voidPercentage`

) that this entails.

This makes it possible to take into account very incoherent curves with missing values but which represent, as much as possible, a large movement. Overall, my tests pointed out that `inAreaImportance = 6`

was a value that allowed good results.

Additionally, I apply a *penalty* on the score such as `fitness *= _UNFIT_COEFF`

as well as an attribute of non-eligibility if the individual does not meet the following conditions:

`a1 = 0`

or`a2 = 0`

,`x3 - x1 < _MIN_WIDTH`

where`_MIN_WIDTH`

is the minimum acceptable width,`|h1| > max`

`|h1| > max`

or`|h2| > max`

`|h2| > max`

where`max`

is the maximum absolute value of the data set.

Why not directly eliminate the individual rather than assign a penalty? It is possible that a cross with another individual or his mutation results in an individual who has a good score and meets the criteria.

### CHOICE OF THE BEST

It only remains to determine when it is no longer necessary to continue to seek a solution or in other words, when the best individual (*the elite*) has a great chance of being the best solution obtainable. For this, I use several parameters whose values have been determined after trial and error:

- The elite fulfills the additional criteria described above (
**eligibility**) - The elite has gone through more than 15 generations (
**relevance in time**), - The elite has sufficient
`inPercentage`

and`voidPercentage`

values (respectively above 65% and below 80% –**relevance of the result**).

## Obtained result

I built a web-app which allows to simulate the flow returned by the accelerometer of a smartphone and to apply the genetic algorithm in real time (but slowed down). Below is an example of what I get for a certain dataset:

You can view the simulation via JSFiddle. It’s also possible to dynamically modify certain parameters via in-app controls that determine the behavior of the algorithm as explained above.

### Benefits / Conclusion

Overall, the algorithm gives good results when there is good reason to have good results, i.e. when the accelerations are exploitable (**true positives**). Otherwise, the results are very hazardous and do not correspond to anything (**false positives**).

The advantage is that by repeatedly repeating the algorithm on the same data set, the same true positive ones are obtained several times, while the false positives appear sporadically. Repetition would then allow us to isolate the true positives and to support the certainty of obtaining a correct result.

I’m not sure that a genetic algorithm is completely relevant for this problem I think that other heuristics based on average can be just as efficient in requiring less computing time. Nevertheless, as explained above, the probabilistic side of the genetic algorithm allows us to isolate the true positive from false positives.

Finally, the genetic algorithm as I conceived it does double work: linear regression but also selection of data that is worth it, and this will be more complicated given the complexity of the patterns to be detected in the acceleration. On the other hand, it would be possible to imagine a system with a similar purpose that would perform a **classification beforehand** to select the correct data and then apply a heuristic algorithm on it.

Sympa ! Il n’y a pas un risque en cas d’un pic de valeur absurde (défaut capteur) ? Si on ne le prenait pas en compte dans le calcul de l’aire, inPercentage serait très bas. Or inPercentage semble très important dans le calcul de la fitness (inAreaImportance = 6 comparé au ^2 de voidPercentage), peut être que l’algo serait tenté de produire une aire plus verticale (h2 >> x3-x2) afin de maximiser inPercentage et donc de produire un résultat faux à tous les coups.

Merci d’avoir lu, et d’avoir commenté !

Ça dépend pas mal. Le fait que l’algorithme soit exécuté “real-time” complique pas mal les choses.

Si le pic est tout début (par exemple les 5 premières valeurs) alors je pense qu’en effet l’algorithme trouve une solution très verticale (#1). Et encore, si la hauteur h1 ou h2 est très importante, alors il faudra compenser par une distance x3-x1 proportionnellement importante, or x3 et x1 sont bornés par le set de données.

Si le pic est au milieu du set de donnée, en effet l’algorithme tend à l’inclure et du coup, ne prend pas en compte une meilleure solution sans ce pic. Par contre, vu que le pourcentage de vide augmente énormément, la solution n’est pas choisie. Donc en soit on évite la casse car on ne valide aucune solution, mais on n’en prend aucune non plus, ce qui est problématique. Exemple : https://jsfiddle.net/franpapers/djno9uag/9/

Je pense qu’il faudrait baisser inAreaImportance mais appliquer un contrôle plus stricte sur la validation finale.

Great post.