The WSJ has this rather interesting story up on The Internet...
The Journal's study shows the extent to which Web users are in effect exchanging personal data for the broad access to information and services that is a defining feature of the Internet.
In an effort to quantify the reach and sophistication of the tracking industry, the Journal examined the 50 most popular websites in the U.S. to measure the quantity and capabilities of the "cookies," "beacons" and other trackers installed on a visitor's computer by each site. Together, the 50 sites account for roughly 40% of U.S. page-views.
Well, yes and no.
Let's quantify a few things and define some terms.
A cookie. A cookie is an alphanumeric code. It is of no value to anyone except the person who sets it. That person (the web site) uses it to link you to some action - for example, it can be an authentication token, so when you come back to a site the site "knows you." It can link a shopping cart (held on that site, say, Amazon.com) to you, so if you close a browser window and come back, it's still there. But the cookie itself doesn't carry any useful information - rather, it is a bookmark into that useful information. Note that a cookie, by itself, is not a password. For example, a cookie on my forum might be "1020939395912391912959". Can you divulge a password from that, or even a login ID? Not directly. If you were to send it (and it was a valid cookie), however, you would discover that it indeed "belongs" to a given user, in that you would be signed in as that person. Cookies are typically very "sparse-space" objects to prevent "guesses" from being effective - that is, a site might use a 64 or 128-bit wide "space" from which it generates a random number to use as the cookie. As such the odds of guessing at random one that actually corresponds to something useful is close to zero.
A "beacon" or "silent file". These are typically zero-size (or nearly-so images. Their purpose is simply to identify that you visited a given page where they appear, and record that with your IP address. That recording comes out of the setting site's access logs.
Cookies are not a concern in and of themselves. They store nothing, and unless you submit some information the only thing a cookie does is track your presence. That is, it may identify your unique computer to some site, so it knows you were there "X" number of times. If you want to sign into a site that requires you to provide a password of some sort, you pretty much need to allow cookies to function, because the web's protocol is stateless - that is, once the page you're viewing has downloaded there is no persistent connection between you and the web site. As such without something like a cookie when you click something it has no way to know that you are who you were, so to speak.
A "beacon" is similarly rather useless, other than to identify that you viewed a specific page. Again, the utility of such knowledge is questionable, but more importantly if you viewed it, why do you mind that the person who published it knows you viewed it?
The third category, however, is a problem. And this is where I would draw the line.
For example, if you order a product it's obvious you did. The merchant knows, and he has a right to use that fact, absent some agreement otherwise. But what if you start to type in an order, then abandon it? Does he have a right to know that?
Well, with some of these tools, he does know that. More importantly, if he's got a browser helper or other persistent application on your machine, he then potentially can find that you bought a competitor's product - or are looking at one.
That's a potential problem.
More insidious would be outright nasty things, such as a keystroke logger. While I've yet to see anything like this "in the wild", it is not impossible if you can get someone to load a toolbar or other similar extension.
I find most of this sort of article intentionally alarming, and frankly, misleading. There's nothing wrong with a publisher knowing you looked at their material, any more than there is with a bookstore keeping records of your purchases if you give them identifying information, and using that to mail you coupons or something similar. In that regard the online world is exactly like the offline one.
There's a gray area though when you have a site like eBAY that allows dozens of other companies to stick tracking devices (mostly beacons) on their site. That's a potential problem, because now any firm that buys that space suddenly gets a copy of what you were looking at or doing, and yet you have no relationship with that company.
It would be easy to say "ban that!" but doing so means the advertising model of the web is destroyed. For example, The Market Ticker runs Google "Adsense" ads on the right sidebar. Google thus gets information when you view a Ticker, and they in fact can see what the Ticker was about. I have no control over what they do with this information once they have it. Note that all they're getting in this case is that you looked, but still.....
The forum, likewise, displays ads on the top bar and, for non-donors, interleaved with messages. Again, Google can "see" what the link is you're reading. If you're signed in they can't see beyond the URL (since they aren't signed in) but if you're not on an area that requires authentication they can see the entire contents of the page - just as you do. Do they look and analyze the content on the page? I presume so.
Can you separate this out? Not really, unless you want to make display ads - that is, the payment for eyeballs (not clicks or purchases) unlawful. In addition disallowing the advertiser to see the content associated with his ad would prohibit the targeting of those ads to the relevant content. You'll note that if you read an article about Barack Obama on The Ticker, the ads you see have to do with them. This is only possible because Google can see what you're reading. Without that, targeting becomes impossible to accomplish.
Where I draw the line and am willing to sound the alarm is when a site tries to load an application, without your explicit permission when the application is loaded or launched, and an explicit description of exactly what it stores, sends and to whom and why.
Indeed, I'd go so far as to call such an act theft, since your computer is your property, and by going to a web page you are not giving permission, implicitly or otherwise, for code to be loaded and executed on your computer.
If I caught an advertising agency (e.g. Google) doing that on my sites, they'd be toast. Instantly. What I don't know, because The Journal doesn't tell us, is whether they actually detected this sort of thing - and if so, on which sites they did.
If we need regulation it's on the third category of web activity - that should be absolutely unlawful without the explicit consent of the consumer, who first is apprised of exactly what is being monitored, transmitted and stored, along with what is being transmitted and how it will be used and exactly who is going to get the data, with explicit and strict penalties for either non-disclosure or lying.
In short, the "study" the Journal did is that it doesn't distinguish between content on the host's machine (that is, what you view, which can be gleaned from ordinary web analytics off the log files of the web server) and actual loaded code that is literally spying on you.
Specifically, it is the unauthorized transmission of information - that is, transmission of your data not as a result of clicking something, not as a result of submitting text in a box or form, but rather automatically just because you mouse over or key something - without submitting it - that is problematic.
I'd like to see a follow-up to this, and have so noted to the article's authors....