As described yesterday, I’ll use the logs of my web server to monitor for technical problems, such as wrong links, plus misuse and attacks. Let me share some basic info about this logging business for full disclosure.
What Gets Logged
Basically with each click, you send data from your web browser to the web server, which are required to make things happen, technically:
-
the web page (or more general, the resource), you want to see (get),
-
the web page you’re coming from, the so called referrer, ie. the identifier (URI) of the page where you clicked the link (unless you typed the address yourself),
-
information about your browser,
-
your IP address (your current unique address on the Internet), as otherwise the poor server would not know where to send the requested page, and
-
technical gibberish required to establish a sane technical communication.
All – or part, depending on the web server’s settings – of this information is written into a log file. As said, these logs can help to identify any technical problems, or detect misuse and attacks. So, simply to make the Internet work, you leave behind a trace, also without any trackers on the web pages.
For illustration, here’s the request sent from my web browser when I clicked on the link for the post “Tracking Fireworks” on the home page. Check out the first line with GET
, and the lines Referer
1 and User-Agent
, these are the data items relevant for the user, as outlined above:2
And here is how this request got logged on my nginx server in Frankfurt (it’s one single long line, wrapped at arbitrary places). I am sure you’ll recognise the data from the request, now including the IP address:
This is the type and amount of data you leave behind on any web site, or its corresponding server, to be precise. So let’s think about the possibilities this minimum data set might have for doing some tracking and profiling anyway.
Tracking and Profiling Based on Logs?
Let’s ponder the question if we can be tracked based on that technical information anyway.
For starters, the logging data is with the infrastructure providers, not any data harvesting and profiling companies. Contrary, data “phoned home” by trackers accumulates directly on the computers of the tracking companies. Of course, if the tracking company is also an infrastructure provider, they would have access to these logs. I cannot say how much, say, Amazon is in the tracking business, using the logs from their massive infrastructure. But I think it’s much easier to collect data using tracker scripts in web pages than using logs, and you get much more data this way.
An important aspect here is that trackers leave behind cookies, small pieces of information that uniquely identify you. The tracking script always “knows” it’s you. With logs, you don’t have cookies. Without a cookie that identifies you it’s not possible to exactly trace your steps across multiple clicks, across different websites, and along a timeline – each log entry is a single event.
Of course, the IP address plus the browser information and referrer data together allow some guessing, but if two people are surfing the web from a single IP address, say, in a household, or an office, there’s no easy way to keep them apart – and profile – without cookies. Or if you use your mobile phone at home via WiFi, then on the road via LTE, then in a coffee shop and then in the office using their WiFi, you’ll have different IP addresses at each location, so again, no easy way of identifying you as one individual without a cookie stored on your phone.
So, bottom line, server logs keep some basic information about your on-line activities, and these logs can of course be analysed. However, they are server-specific, and a far cry from the power and possibilities of trackers and their cookies embedded on each and every web page.
I am not denying the principal possibility of some more sophisticated tracking based on server logs. Imagine a company with lots of computing power, large databases, and maybe machine learning – think pattern learning and matching –, they might be able to extract more information using correlation and analysis over longer times, and deduct some patterns and basic profiles this way. But again, just embedding trackers everywhere is so much easier for them.
In case some lawmakers crack down on trackers embedded in webpages in the future – which is highly unlikely at this point, given that data harvesting, profiling and the connected ads are a multi-billion business –, we’ll have to learn and adapt again.
Browser Information
Note that the information the browser sends to the server about itself is not reliable: each browser is free to send whatever it wants, and it can also be changed by the user3 – see this screenshot of the Safari browser on the Mac, it can even pretend to be an Microsoft Edge browser on Windows, even though I am not sure any decent browser would want to appear as that… just kidding, Edge is quite good, at least compared to Internet Explorer.
Maybe Apple and the other browser developers will start to include less detailed information about the browser and platform in the future, in order to make the identification of single users more difficult. Unless… trackers and cookies get legally limited somehow, and the browser information is subverted to include some cookie-like identifying data elements. Ayo.
I think the browser info is even sort of a relict of former times, when the websites had to know which browsers they’re “serving”, in order to cope with their technical differences. Websites that strive to be compatible with older browsers still have lots of browser-specific code (CSS, JavaScript, even HTML), especially for Internet Explorer. With the advent of more and more browsers that adhere to the formal specs, this won’t be necessary anymore. Hopefully. Unless Google finds another twist to monopolise a non-standard format.
As of today, one can iron out most differences using a normalising CSS file. Yay.
-
Yes, Referer, not Referrer. The spec reads: The Referer[sic] request-header field allows the client to specify, for the server’s benefit, the address (URI) of the resource from which the Request-URI was obtained (the “referrer”, although the header field is misspelled.) ↩︎
-
The current IP address is not part of the HTTP message itself, it’s used on a lower protocol layer to establish the connection, if memory serves. ↩︎
-
And of course users can cloak their IP address using VPN technology. ↩︎