Expect to see more online data scraping, thanks to a misinterpreted court ruling

In a case involving LinkedIn, a US appellate court has come to an obvious conclusion: scraping publicly-visible online data and content doesn't violate The Computer Fraud and Abuse Act. What does it mean? That's where things get interesting.

Hacking stealing password data
Thinkstock

A US appellate court, in a case involving LinkedIn, recently ruled that data scraping publicly-visible information does not violate The Computer Fraud and Abuse Act. 

This decision — ZDNet's take is here — has a reality component and a perception component. In reality, the ruling is delightfully narrow and unlikely to have much of a legal impact. As for the perception part, that's where enterprise Web chiefs and their IT colleagues are likely to suffer a big chunk of headaches. The same is true of enterprise marketing execs (but most of them deserve it).

Reality: The ruling didn’t say that web-scraping from competitors is legal. It merely said that it didn’t violate this specific law. It might violate other criminal laws and certainly some civil laws, but the panel only ruled on what was presented to it, as it should.

But the perception of most people, egged on by misleading headlines that the court gave a legal greenlight to all scraping, is that the practice is now legal and scrapers can proceed aggressively. Even though the court said nothing of the kind, it’s easy to predict that this will fuel an increase in scraping.

How much of an increase? Well, it won’t likely be a big increase. Why? Because the kind of people that steal content through scraping are not exactly holding back when it comes to the law. It’s not as though there are a ton of marketers who wanted to scrapebut judiciously held back until the courts ruled on scraping’s legality. 

That said, the misinterpretation of this ruling will encourage scrapers to do a lot more scraping. 

What can and should IT do about that? Given that these are generally publicly-visible pages, it’s a problem. There are few technical methods to block scrapers that wouldn’t cause problems for the site visitors the enterprise wants.

Years ago, I was managing a media outlet that was making a huge move to premium content, meaning that readers would now have to pay for selected premium stories. We ran into a problem. We couldn’t allow people to freely share premium content, as we needed people to buy those subscriptions. 

That meant that we blocked cut-and-paste and specifically blocked someone from saving the page as a PDF. But that meant that those pages also couldn’t be printed. (Saving as PDF is really printing to PDF, so blocking PDF downloads meant blocking all printers.) It took just a couple of hours before new premium subscribers screamed that they paid for access and they need to be able to print pages and read them at home or on a train. After quite a few subscribers threatened to cancel their paid subscriptions, we surrendered and reinstated the ability to print. (And our fears were confirmed; PDFs of our premium content started appearing all over the place.)

That dilemma is similar to fighting scraping efforts. And most web people will quickly conclude that just accepting the scrapers is probably the best call.

Getting back to the LinkedIn case, I would argue that even citing The Computer Fraud and Abuse Act was a massive and wrong-headed argument from LinkedIn. A better — though perhaps equally unlikely-to-win–argument — would be copyright violations.

LinkedIn’s particulars make that argument tough. Unlike a media outlet (such as "Computerworld") LinkedIn doesn’t pay money to create excellent content. The overwhelming amount of content being scraped involves what LinkedIn  customers individually write for free. Can LinkedIn even argue with a straight face that it legitimately owns all of the information in my resume, which I posted on my page on LinkedIn? 

If LinkedIn paid me to post comments and messages and work history details, then maybe it could argue ownership. But that’s not what they do. 

However, do users expect material they post on LinkedIn to appear only on LinkedIn? More to the point, do those users have any realistic expectations that it will stay put? I, like a lot of reporters, have often gone to a LinkedIn page to check on biographical information from or a source or double-check a person's professional background info for a column or post I'm writing. Does anyone challenge my right to do so? 

And where exactly should the line be drawn on what constitutes scraping? Is referencing one title scraping? How about four prior titles from one person, or 10? Or if it's information on more than 100 people? That’s a problem, because if LinkedIn decides to not worry about small data references, it undermines its ability to pursue the big ones.

This is where we get into the public space argument. If I post something sensitive about myself in a public forum on a large discussion site, do I have a reason to expect privacy? (Actually, I might because no one cares what I think, but I digress.) If I had wanted something kept quiet, I wouldn’t have publicly posted it.

One of the more interesting uses journalists have with LinkedIn is reviewing the details of someone’s experience. Why? Because we know that a lot of coders and other technical talent will massively overshare, detailing what they did on projects for their employer, including lots of highly sensitive information about the systems they worked on, applications their employer purchased, and even unannounced security holes they fixed. 

The only legal action is that their companies could fire them for disclosing internal information. But the coder who posted it has no course of action. It was their choice.

 In short, I think we can all expect more scraping and content-stealing — and IT will sadly find that it really can't do much to stop it.

Copyright © 2022 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon