This article shows the usage of a genetic algorithm to perform a linear regression subjected to specials constraints over a linear acceleration obtained by smartphone sensors. I explain how I try to improve its coherence helped by a genetic algorithm, relevant for its ability of adaptation.
You can already visualize the result of this implementation via JSFiddle.
The linear acceleration of a motion is equivalent to the acceleration excluding the component of gravity. This result is often obtained by sensor fusion (accelerometer, gyroscope, magnetometer, etc.) and by the use of many technics (Kalman filter, particle filter, etc.). With Android, linear acceleration is accessible through SensorManager
with the virtual sensor Sensor.TYPE_LINEAR_ACCELERATION
.
The premise : linear acceleration isn’t consistent
Linear acceleration is imprecise, mostly noisy due to the performed operations over others sensors values to compute it, and especially is inconsistent. Consider the following example which represents a simple movement of a smartphone at arm’s length:
It’s possible to identify the same repeating pattern represented by the two turquoise zones, corresponding to the beginning of the movement (zone 1 – the telephone gets a certain speed and moves) and then to its end (zone 2 – the telephone loses its speed). However, here are the following inconsistencies:
 The sum of the positive and negative accelerations does not cancel each other out: we will therefore try to construct a model where the acceleration is zero.
 The curve is interspersed with very close extreme values: we must model the acceleration constantly over a period to work better with.
MODEL Construction
In a way, I want to change my complex and hardtoexploit movement into a uniformly accelerated rectilinear movement:
Thus, I will obtain a constant acceleration over a longer period and easier to exploit, but with additional constraints (following the two inconsistencies cited above):
a1 = a2
because the sum of the accelerations must be null,a1
anda2
must approach as much as possible the sum of the accelerations between the terminals.
Let h1
be first area’s height and h2
respectively the second, we have a1 = h1*(x2x1) = h2*(x3x2) = a2
. There is then an infinity of possible combinations as a function of the variables x1, x2, x3, h1
and h2
there is no suitable deterministic algorithm and has a reasonable computational cost.
And for this reason, it’s interesting to use a genetic algorithm to determine the best solution. Additionally, its adaptability will be appreciated in order to take into account the accelerations coming from different sensors of different smartphones, or to be more flexible with the pattern recognition.
A genetic algorithm to find the best solution
The aim of a genetic algorithm is to provide a solution inspired by natural evolution in order to solve an optimization problem difficult to approach and which depends upon many parameters. Then we select into a population of individuals (chromosomes) those who are the most able to survive and as the generation number goes up, approach a solution close to the desired result.
Random generation
Every generation is made of 15 individuals. Using the diagram above, I modeled the individuals DNA of our genetic algorithm in the following way:
DNA: { pivots: [x1: float, x2: float, x3: float], h1: float, h2: float, }
When creating a new individual, the three pivots are max <= h1 <= max, h1 != 0
with 0 <= x1 < x2 < x3 <= size1
(where size
is the of the data set), we also have h1
such that max <= h1 <= max, h1 != 0
and h2 = h1 * (x2  x1) / (x1  x0)
given max
the absolute maximum value of the data set.
Selection
Selection is crucial to rapidly reaching towards a great solution:
 Crossover of the two individuals if they’re not (or almost not) equivalent (rank selection), otherwise crossing the best and a second random one (almost uniform selection).
 Suppression of the two worst individuals.
 Insertion of two new randomly generated individuals.
The insertion of two completely random individuals is important to ensure that the algorithm doesn’t converge towards a wrong solution and be trapped in a local extremum (additionally to mutation).
Mutation
The mutation probability of an individual is calculated according to its rank, given that the lowest rank is the best: p = (n/nmax)²
with n
the rank and nmax
the maximum rank.
The probability is quadratic to emphasize the need to mutate individuals who have a bad score.
DNA mutation can happen in four different ways, of equal probability each:
x1
takes a new value in the interval[0 ; x2[
[0 ; x2[

x2
takes a new value in the interval]x1 ; x2[
]x1 ; x2[

x3
takes a new value in the interval]x2 ; size1]
]x2 ; size1]

h1
is redrawn randomly in the same way as the creation (and thereforeh2
is also modified).
Fitness function
The fitness function determines how well an individual meets the requirements and assigns its score which eventually determines its rank. The fitness function should be able to be calculated quickly while being relevant for determining what the right solutions are.
Let inPercentage
the percentage of area under the curve between x1
and x3
(the terminals), voidPercentage
the percentage of “vacuum” in the areas a1
and a2
between the terminals:
fitness = (100  inPercentage)^inAreaImportance + voidPercentage^2
A good individual is indeed an individual who possesses a low fitness score. Thus, more inPercentage
tends towards 100 and voidPercentage
tends towards 0 plus the score decreases, the more the individual gets a better rank. The greater the inPercentage
greater the inPercentage
becomes crucial in the total score: this allows giving more importance to the values correctly between the (inPercentage
) in spite of the proportion of void (voidPercentage
) that this entails.
This makes it possible to take into account very incoherent curves with missing values but which represent, as much as possible, a large movement. Overall, my tests pointed out that inAreaImportance = 6
was a value that allowed good results.
Additionally, I apply a penalty on the score such as fitness *= _UNFIT_COEFF
as well as an attribute of noneligibility if the individual does not meet the following conditions:

a1 = 0
ora2 = 0
, 
x3  x1 < _MIN_WIDTH
where_MIN_WIDTH
is the minimum acceptable width, h1 > max
h1 > max
orh2 > max
h2 > max
wheremax
is the maximum absolute value of the data set.
Why not directly eliminate the individual rather than assign a penalty? It is possible that a cross with another individual or his mutation results in an individual who has a good score and meets the criteria.
CHOICE OF THE BEST
It only remains to determine when it is no longer necessary to continue to seek a solution or in other words, when the best individual (the elite) has a great chance of being the best solution obtainable. For this, I use several parameters whose values have been determined after trial and error:
 The elite fulfills the additional criterias described above (eligibility)
 The elite has gone through more than 15 generations (relevance in time),
 The elite has sufficient
inPercentage
andvoidPercentage
values (respectively above 65% and below 80% – relevance of the result).
Obtained result
I built a webapp which allows to simulate the flow returned by the accelerometer of a smartphone and to apply the genetic algorithm in real time (but slowed down). Below is an example of what I get for a certain dataset:
You can view the simulation via JSFiddle. It’s also possible to dynamically modify certain parameters via inapp controls that determine the behavior of the algorithm as explained above.
Benefits / Conclusion
Overall, the algorithm gives good results when there is good reason to have good results, i.e. when the accelerations are exploitable (true positives). Otherwise, the results are very hazardous and do not correspond to anything (false positives).
The advantage is that by repeatedly repeating the algorithm on the same data set, the same true positive ones are obtained several times, while the false positives appear sporadically. Repetition would then allow us to isolate the true positives and to support the certainty of obtaining a correct result.
I’m not sure that a genetic algorithm is completely relevant for a linear regression, I think that other heuristics based on average can be just as efficient in requiring less computing time. Nevertheless, as explained above, the probabilistic side of the genetic algorithm allows us to isolate the true positive from false positives.
Finally, the genetic algorithm as I conceived it does double work: linear regression but also selection of data that is worth it, and this will be more complicated given the complexity of the patterns to be detected in the acceleration. On the other hand, it would be possible to imagine a system with a similar purpose that would perform a classification beforehand to select the correct data and then apply a heuristic algorithm on it.
Sympa ! Il n’y a pas un risque en cas d’un pic de valeur absurde (défaut capteur) ? Si on ne le prenait pas en compte dans le calcul de l’aire, inPercentage serait très bas. Or inPercentage semble très important dans le calcul de la fitness (inAreaImportance = 6 comparé au ^2 de voidPercentage), peut être que l’algo serait tenté de produire une aire plus verticale (h2 >> x3x2) afin de maximiser inPercentage et donc de produire un résultat faux à tous les coups.
Merci d’avoir lu, et d’avoir commenté !
Ça dépend pas mal. Le fait que l’algorithme soit exécuté “realtime” complique pas mal les choses.
Si le pic est tout début (par exemple les 5 premières valeurs) alors je pense qu’en effet l’algorithme trouve une solution très verticale (#1). Et encore, si la hauteur h1 ou h2 est très importante, alors il faudra compenser par une distance x3x1 proportionnellement importante, or x3 et x1 sont bornés par le set de données.
Si le pic est au milieu du set de donnée, en effet l’algorithme tend à l’inclure et du coup, ne prend pas en compte une meilleure solution sans ce pic. Par contre, vu que le pourcentage de vide augmente énormément, la solution n’est pas choisie. Donc en soit on évite la casse car on ne valide aucune solution, mais on n’en prend aucune non plus, ce qui est problématique. Exemple : https://jsfiddle.net/franpapers/djno9uag/9/
Je pense qu’il faudrait baisser inAreaImportance mais appliquer un contrôle plus stricte sur la validation finale.
Great post.