As I’m running this site and a couple of other sites I always wanted to see how many users have viewed the sites. And so I researched how other people have handled it. But most of the information you will find about it recommends using an external analytics service or use an open-source solution you can host by yourself.
I wasn’t happy with both of them. An external service means that details about visitors are passed on to that service. And I don't want to do that because I try to respect the privacy of every visitor and not bother them with any consent forms.
Self-hosting was a solution and I thought about it as I’m already hosting some services by myself. But it’s just about some simple page views I want to see. Having a full service that has to run and be maintained is additional work I don’t want to handle.
So, I needed another solution. And, I already had it in my mind for some time: Utilizing the web server logs for my site.
Server logs for the win – or not?
The first thought was that it should be quite easy to just parse the entries and group them by data. This should give me a good overview of the times a page was viewed on a specific date.
But when I did my first look into the logs I immediately had two questions that came up:
- How can I identify page requests only so that I don’t include assets like CSS and images?
- How can I exclude spam requests?
I realized that it wasn’t as easy as I thought in the beginning. However, I wanted to try to keep the approach in which I analyze the logs to get the information I need.
So, I came up with a new idea: A special URL that is only requested by pages and can be easily detected in the logs.
Simple page view tracking was born
The solution I had in mind was somehow easy and required only a small modification of the website itself and some tooling to collect the information from the logs. So I came up with a simple snippet which I added to my site:
<script>
if (window.fetch) {
fetch(`/_p?l=${window.location.href}`, { method: 'POST' }).catch(() => {});
}
</script>
It does a POST
request on the path /_p
and contains the current location as a query parameter. To prevent , I’ve added an empty file call _p
to my site so that the server responds with a success status instead of a “Not Found” status. If choosen to use a POST
request because it’s not cached by the browser and makes it easier to detect in the logs.
Now it was time to analyze the protocols. In the beginning, I’ve done that with some shell magic by searching for strings in the code and some primitive parsing of the results. Later I worked on a small CLI that takes the logs as an input and prints the page views per date on the console. (As soon as I’ve cleaned it up a little bit I’ll publish the CLI)
I was quite happy with the solution and extended it by a small portion to allow me the see where the users came from. That is mostly relevant for some sites as I have some references to those out there. The part I added was quite simple as well as you can see in the final snippet.
<script>
if (window.fetch) {
const url = new URL(window.location.href);
let referrer = document.referrer;
if (url.searchParams.has('ref')) {
referrer = url.searchParams.get('ref');
url.searchParams.delete('ref');
}
fetch(`/_p?l=${url.href}&r=${referrer}`, { method: 'POST' }).catch(() => {});
}
</script>
After that, I’ve modified the CLI as well so that it handles the new parameter and outputs the referrer information if available.
I have to note that the parsing the logs is not the most efficient way to do it. But as I’m not running a huge site with thousands of visitors per day it’s totally fine for me. And if I ever need to scale it up I can still optimize the parsing process by splitting the logs into smaller chunks and process them in parallel of do something else.
In the end, this approach not only gives me some basic information about my website traffic without a huge system behind it but also respects the users privacy.