Nginx-Based Web Analytics (Without the Analytics)
At the start of the year I set up a small server for external web applications and, since it had enough storage available, I moved my Jekyll blog there. I renewed my personal domain, set up a local user with limited access to the server, and deployed the blog.
I mostly treat this blog as a writing hobby and a kind of thought-process log. It helps me relax, but also think and reflect on most of the topics you see here (and a few hidden drafts). Measuring traffic wasn’t a priority, and neither was invading user privacy with tracking scripts, so I never added any analytics or user-engagement tools.
That said… it is interesting to know if anyone is actually reading this stuff.
Why not “real” analytics?
Jekyll supports Google Analytics (among many other tracking systems), and GA is the obvious choice if you want to measure web traffic. It makes perfect sense for corporate pages where you care about funnels, conversions, A/B tests, and sales attribution.
But this isn’t that.
For a small personal blog, Google Analytics would be massive overkill. More importantly, I’m not particularly interested in feeding even more behavioral data into big data platforms. They already know plenty — they can leave my small digital corner alone.
Sure, I could write a small JavaScript snippet that sends events to a backend endpoint (Flask, FastAPI, pick your poison). That would work. But it’s extra moving parts, extra maintenance, and it almost inevitably leads to cookies, identifiers, and “just one more metric”.
I wanted something simple, local, cookie-free, and boring. So, being a lazy developer, I decided to work with what I already had: the web server.
Logs as a data source
This blog is served by Nginx. That means traffic information is already sitting there, quietly, in the access logs. No scripts, no beacons, no JavaScript required. All we need to do is parse them.
Let’s assume a typical Nginx log format and start small: count hits to blog posts under /posts, ignore obvious bots, and see what falls out.
NBP KPI Script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env bash
set -euo pipefail
# -------- CONFIG --------
DOMAIN="joaocicio.com"
LOG_DIR="/var/log/nginx"
POST_PATH_REGEX="^/posts/"
BOT_REGEX="bot|spider|crawl|slurp|facebookexternalhit|monitor"
MAX_RESULTS=50
# ------------------------
LOG_PATTERN="${LOG_DIR}/${DOMAIN}.access.log*"
if ! ls $LOG_PATTERN >/dev/null 2>&1; then
echo "Error: no nginx logs found for ${DOMAIN}"
exit 1
fi
sudo zcat -f $LOG_PATTERN | awk -v post_re="$POST_PATH_REGEX" -v bot_re="$BOT_REGEX" '
BEGIN {
IGNORECASE = 1
}
$0 !~ bot_re && $7 ~ post_re {
count[$7]++
}
END {
for (p in count)
printf "%8d %s\n", count[p], p
}
' | sort -nr | head -n "$MAX_RESULTS"
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ ./nbp_v1.sh
41 /posts/a-strategic-look-at-the-russo-ukrainian-war/
38 /posts/a-minimal-backup-system-using-standard-linux-tools/
36 /posts/the-battle-of-thermopylae/
26 /posts/connecting-multiple-clusters-with-opnsense-and-wireGuard/
19 /posts/cordynet-a-mycelium-based-network/
18 /posts/checklist-nis2/
17 /posts/sobre-a-nis2/
8 /posts/aristotlian-ethics/
8 /posts/ansible-script-for-basic-server-security/
8 /posts/alien-earth-review/
5 /posts/unexpected-company-on-the-montain-trail/
5 /posts/on-the-rise-of-stoicism/
5 /posts/arcane-review/
5 /posts/alien-romulus-review/
4 /posts/goodbye-vms-hello-ubuntu-server/
3 /posts/back-to-blogging/
2 /posts/the-past-10-years/
1 /posts/the-past-10-years/%7Burl%7D
1 /posts/goodbye-vms-hello-ubuntu-server/%7Burl%7D
1 /posts/back-to-blogging/%7Burl%7D
1 /posts/alien-romulus-review/%7Burl%7D
1 /posts/alien-earth-review/%7Burl%7D
And it works. We get a list of posts ordered by total hits. We also immediately spot some odd URLs (%7Burl%7D), likely a templating or link-generation artifact. I’ll just ignore them for now.
This already answers a basic question: which posts are being requested the most?
But total hits alone don’t tell the full story.
From hits to unique clients
In the next iteration, I extended the script to track not only total views, but also unique clients, based on the client IP address. While not fully accurate (because of NAT, VPNs, etc.), unique client IPs are the best approximation I can get to “unique clients” from server logs alone. And that’s fine. The goal here is not precision — it’s signal.
NBP KPI Script v2:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env bash
set -euo pipefail
# -------- CONFIG --------
DOMAIN="joaocicio.com"
LOG_DIR="/var/log/nginx"
LOG_PATTERN="${LOG_DIR}/${DOMAIN}.access.log*"
POST_PATH_REGEX="^/posts/"
BOT_REGEX="bot|spider|crawl|slurp|facebookexternalhit|monitor"
MAX_RESULTS=50
# ------------------------
if ! ls $LOG_PATTERN >/dev/null 2>&1; then
echo "Error: no nginx logs found for ${DOMAIN} (${LOG_PATTERN})"
exit 1
fi
sudo zcat -f $LOG_PATTERN | awk -v post_re="$POST_PATH_REGEX" -v bot_re="$BOT_REGEX" '
BEGIN {
IGNORECASE = 1
}
$0 !~ bot_re && $7 ~ post_re {
path = $7
ip = $1
views[path]++
key = path SUBSEP ip
if (!(key in seen)) {
seen[key] = 1
uniq[path]++
}
}
END {
for (p in views) {
printf "%d\t%d\t%s\n", views[p], (p in uniq ? uniq[p] : 0), p
}
}
' \
| sort -t $'\t' -k1,1nr \
| head -n "$MAX_RESULTS" \
| awk 'BEGIN { print "VIEWS\tUNIQUE_IPS\tPATH" } { print }' \
| column -t
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ ./jcicio-uniques.sh
VIEWS UNIQUE_IPS PATH
41 7 /posts/a-strategic-look-at-the-russo-ukrainian-war/
38 7 /posts/a-minimal-backup-system-using-standard-linux-tools/
36 11 /posts/the-battle-of-thermopylae/
26 15 /posts/connecting-multiple-clusters-with-opnsense-and-wireGuard/
19 8 /posts/cordynet-a-mycelium-based-network/
18 15 /posts/checklist-nis2/
17 9 /posts/sobre-a-nis2/
8 6 /posts/alien-earth-review/
8 7 /posts/ansible-script-for-basic-server-security/
8 8 /posts/aristotlian-ethics/
5 4 /posts/unexpected-company-on-the-montain-trail/
5 5 /posts/alien-romulus-review/
5 5 /posts/arcane-review/
5 5 /posts/on-the-rise-of-stoicism/
4 3 /posts/goodbye-vms-hello-ubuntu-server/
3 3 /posts/back-to-blogging/
2 2 /posts/the-past-10-years/
1 1 /posts/alien-earth-review/%7Burl%7D
1 1 /posts/alien-romulus-review/%7Burl%7D
1 1 /posts/back-to-blogging/%7Burl%7D
1 1 /posts/goodbye-vms-hello-ubuntu-server/%7Burl%7D
1 1 /posts/the-past-10-years/%7Burl%7D
Now we can see something more interesting: posts with fewer total hits can still attract more unique clients, while others get repeated views from a smaller audience. Already, this tells a better story than raw hit counts.
Filtering for “real” browsers
In a third iteration, I tightened the filter a bit more by dropping obvious bots via User-Agent matching, keeping only browser-like User-Agents, and normalizing URLs to avoid duplicates caused by trailing slashes or query strings.
Again, this is best-effort filtering, not a guarantee. Plenty of bots pretend to be browsers, and some real clients don’t look like one. This is more of an attempt to reduce noise than achieving correctness.
NBP KPI Script v3:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/usr/bin/env bash
set -euo pipefail
# -------- CONFIG --------
DOMAIN="joaocicio.com"
LOG_DIR="/var/log/nginx"
LOG_PATTERN="${LOG_DIR}/${DOMAIN}.access.log*"
# Count only blog posts
POST_PATH_REGEX="^/posts/"
# Drop obvious bots
BOT_REGEX="bot|spider|crawl|slurp|facebookexternalhit|monitor"
# Keep only browser-like User-Agents
BROWSER_REGEX="Mozilla|Chrome|Safari|Firefox|Brave"
MAX_RESULTS=50
# ------------------------
if ! ls $LOG_PATTERN >/dev/null 2>&1; then
echo "Error: no nginx logs found for ${DOMAIN}"
exit 1
fi
sudo zcat -f $LOG_PATTERN | awk \
-v post_re="$POST_PATH_REGEX" \
-v bot_re="$BOT_REGEX" \
-v browser_re="$BROWSER_REGEX" '
BEGIN {
IGNORECASE = 1
OFS = "\t"
}
# nginx combined log format assumptions:
# $1 = client IP
# $7 = request path
$0 !~ bot_re &&
$0 ~ browser_re &&
$7 ~ post_re {
path = $7
ip = $1
# normalize URLs
sub(/\?.*$/, "", path) # strip query string
sub(/\/$/, "", path) # strip trailing slash
views[path]++
key = path SUBSEP ip
if (!(key in seen)) {
seen[key] = 1
uniq[path]++
}
}
END {
for (p in views) {
printf "%d\t%d\t%s\n", views[p], uniq[p], p
}
}
' \
| sort -t $'\t' -k1,1nr \
| head -n "$MAX_RESULTS" \
| awk 'BEGIN { print "VIEWS\tUNIQUE_IPS\tPATH" } { print }' \
| column -t
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
VIEWS UNIQUE_IPS PATH
41 7 /posts/a-strategic-look-at-the-russo-ukrainian-war
38 7 /posts/a-minimal-backup-system-using-standard-linux-tools
36 11 /posts/the-battle-of-thermopylae
26 15 /posts/connecting-multiple-clusters-with-opnsense-and-wireGuard
19 8 /posts/cordynet-a-mycelium-based-network
18 15 /posts/checklist-nis2
17 9 /posts/sobre-a-nis2
8 6 /posts/alien-earth-review
8 7 /posts/ansible-script-for-basic-server-security
8 8 /posts/aristotlian-ethics
5 4 /posts/unexpected-company-on-the-montain-trail
5 5 /posts/alien-romulus-review
5 5 /posts/arcane-review
5 5 /posts/on-the-rise-of-stoicism
4 3 /posts/goodbye-vms-hello-ubuntu-server
3 3 /posts/back-to-blogging
2 2 /posts/the-past-10-years
1 1 /posts/alien-earth-review/%7Burl%7D
1 1 /posts/alien-romulus-review/%7Burl%7D
1 1 /posts/back-to-blogging/%7Burl%7D
1 1 /posts/goodbye-vms-hello-ubuntu-server/%7Burl%7D
1 1 /posts/the-past-10-years/%7Burl%7D
The result? Pretty much the same distribution. Not great, not terrible — which is actually reassuring. It means bot noise wasn’t massively skewing the earlier results.
Summary
Now, I don’t think I need to state this, but just for clarity’s sake, this is not an analytics platform. It will not provide demographics, session tracking, funnels, attribution, behavioral analysis. It is not meant to compete with, or be an alternative to, Google Analytics, Plausible, or any other tool in that space. And that’s intentional.
This is a lightweight traffic measurement tool, that will give you a pretty accurate picture of traffic on your website, without cookies, JavaScript, tracking pixels, and without user identifiers beyond what already exists in server logs. It just tells you about page requests, counted and summarized.
And in doing so, it does answer simple questions:
- Are people hitting my posts?
- Which ones get read the most?
- Do some posts attract more distinct clients than others?
And, for a personal blog, I think that’s more than enough.
Possible evolutions (without betraying the spirit)
If I ever feel like taking this a step further, there are obvious and still privacy-respecting additions that I can explore, like exporting results as CSV and using cron to generate daily, weekly, or monthly summaries, correlating traffic sources using referer headers, or even enriching IPs with GeoIP at a very coarse level.
At that point, you could build a very basic view of traffic trends, while keeping it local and still cookie-free. But for now? This is good enough. I’ll leave that rabbit-hole for some other time.
Final thoughts
Is it pretty? No. Do I know the demographics? Also no. Do I know traffic sources? Not really. But does it do what I wanted it to do? Absolutely!
Poor man’s analytics, built with shell scripts and log files. No personal invasion, no behavioral tracking, no silent surveillance. Just hits.
Clean, as every privacy-respecting tool should be.