Uniformly randomized forests and detection of social contribution irregularities.

Authors
Publication date
2014
Publication type
Thesis
Summary In this thesis, we present an application of statistical learning to the detection of social security irregularities. The purpose of statistical learning is to model problems in which there is a relationship, generally non-deterministic, between variables and the phenomenon that one seeks to evaluate. An essential aspect of this modeling is the prediction of unknown occurrences of the phenomenon, based on data already observed. In the case of social security contributions, the representation of the problem is expressed by the postulate of the existence of a relationship between the declarations of contributions made by companies and the controls carried out by the collection agencies. The control inspectors certify the correctness or inaccuracy of a certain number of declarations and notify, if necessary, an adjustment to the companies concerned. The learning algorithm "learns", thanks to a model, the relationship between the declarations and the results of the controls, and then produces an evaluation of all the declarations not yet controlled. The first part of the evaluation assigns a regular or irregular character to each declaration, with a certain probability. The second part estimates the expected adjustment amounts for each return. Within the URSSAF (Union de Recouvrement des cotisations de Sécurité sociale et d'Allocations Familiales) of Île-de-France, and in the framework of a CIFRE (Conventions Industrielles de Formation par la Recherche) contract, we have developed a model for detecting irregularities in social security contributions that we present and detail throughout the thesis. The algorithm runs under the open source software R. It is fully operational and has been tested in a real situation during the year 2012. To guarantee its properties and results, probabilistic and statistical tools are needed and we discuss the theoretical aspects that accompanied its design. In the first part of the thesis, we make a general presentation of the problem of the detection of irregularities in social contributions. In the second part, we address the detection specifically, through the data used to define and evaluate the irregularities. In particular, the only available data are sufficient to model the detection. We also present a new random forest algorithm, named "uniformly random forest", which constitutes the detection engine. In the third part, we detail the theoretical properties of uniformly random forests. In the fourth part, we present an economic point of view, when the irregularities in the social contributions have a voluntary character, this in the context of the fight against concealed work. In particular, we are interested in the link between the financial situation of firms and social security fraud. The last part is devoted to the experimental and real results of the model, which we discuss.Each chapter of the thesis can be read independently of the others and some notions are redundant in order to facilitate the exploration of the content.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr