I like empirical analysis and I like econometric techniques. But sometimes I think of the data I am applying these techniques to. One month ago we were conducting a large-scale survey in West-Africa. One day we had a household with 17 members. The interview took the whole day. The interview would have taken about 2 hours with a household of 5 members. Often interviewers are paid per questionnaire. Guess what incentives that might create. I have read of a survey in South Africa, where entire households have been invented. This was discovered only after some years when researchers tried to visit households again.
In social sciences, empirical studies have tremendously increased in recent years, and survey data (especially in developing countries) are often the underlying data source. Whereas a lot of effort has been spent to improve econometric techniques to deal with problems of identification biases during data analysis, surprisingly little has been done to improve our understanding of data collection, or in other words interviewer dependent data.
And here comes Benford’s law:
Newcomb (1881) and later Benford (1938) observed that the leading digits in a natural list of numbers are not uniformly distributed but follow a logarithmic distribution, a regularity that was much later proofed by Hill (1995). Most people do not know of this relationship (now you know) and are hence not good in artificially creating data sets. Humans have a tendency for producing a uniform distribution of first digits. Benford’s law can hence be used as a diagnostic tool to screen large data sets for irregularities; and it has already been widely applied to detect tax fraud, scientific fraud and/or election fraud.
I recently came across an interesting paper, which I think is worth further exploration. Judge and Schechter (2009) have proposed to use Benford’s law to analyze the data quality of surveys (in developing countries). More research is certainly necessarily to find out when Benford’s law can and cannot detect data manipulations. But in any case, I think survey data quality deserves more systematic attention than it has in the past, either through statistical detection techniques or through a better understanding of interviewers’ incentives and interest in “true” figures.