Different types of data are needed to use machine learning. Data can be acquired from different sources such as from websites, databases, CRM systems and more. The goal of this project was to build a tool that does the collection of anonymized user interactions from websites.
- Single-click deployment and build process
- Schemaless data collection mechanism
- Asynchronous implementation
It was possible to reverse engineer most of the features that high-end tracking solutions such as Mixpanel, Google Analytics or Amplitude offer. Most of the tracking functionality is already available or can be easily added. Due to the single-click deployment and automated build process the setup takes 5 to 10 minutes. As this is open source software, there is no need to pay any fees and the cost savings are significant. This solution enables companies to own the data and there is no need to ship the data to external providers, which increases privacy. The functionality to opt-out from the tracking by the user is part of this solution. No personally identifiable information gets sent anywhere. Since the tracker is loaded asynchronous, there are no performance penalties.
Almost any functionality that the high-end tracking tools offer is included in this solution. The tracker was written in a such a way that it works on every browser. Users who don’t wish to be tracked, can opt-out from the tracking.
This solution was tested on production systems. It’s a robust working prototype. Some other aspects need to be considered in the future: - Browsers are moving to the automatic deletion of 1st party cookies after 7 days. Therefore the cookie should be set server-side. Server-side cookies are not deleted by the browsers. - No batching of events and offline tracking. Currently, every call gets sent to the tracking endpoint. This approach is not efficient. A better solution is to send events in batches. The same applies to offline tracking. - The tracking code is not hosted on CDN (content delivery network). To decrease the latency, the tracking code should be served from a CDN.
Due to time constraints, the code is not covered with tests. In the future versions of the project, all mentioned aspects will be improved or further developed.