Johannes' blog

Generating training data from statistics

ChatGPT is a powerful tool that can extract key insights from lengthy PDFs or tables of information. However, one recent application has shown me just how impressive this tool can be. My wife was working on a data science analysis project that required a dataset on crimes in a particular country. While there was plenty of statistical information available, there was no case-by-case database to use for the analysis.

Using ChatGPT, we fed in all the statistical information and asked it to generate a probabilistically accurate sample dataset with the required attributes and characteristics. The resulting Python script utilized a probabilistic approach with weighted randomization that was dependent on the type of crime.