Please don't steal this Web content

By Elinor Mills on 02 August 2007

Tags: aggregator | content | copyright | plagiarism | scraper | steal | site | google | post | blog

Lorelle VanFossen is passionate. An author, travel writer and nature photographer, she also has a popular blog about, well, blogging. Her pet peeve is online plagiarism, which she encounters nearly every day.

"It's one of my favourite subjects," she said. "I make my living from my writing and when people take it because they are ignorant of copyright laws or think that because it's on the Internet it's free, it makes me really mad. It's stealing content, in my mind."

This isn't the kind of plagiarism where a lazy college student copies sections of a book or another paper. This is automated digital plagiarism where software bots can copy thousands of blog posts an hour and publish them verbatim onto Web sites where contextual ads next to them can generate money for the site owner.

Such Web sites are known in Web parlance as "scraper sites" because they effectively scrape the content off blogs, usually through the RSS and other feeds those blogs are sent on.

VanFossen's Lorelle on WordPress is an authority on the Internet for blogging do's and don'ts. One of the no-nos is using content from other sites without getting permission.

VanFossen has several ways of checking to see if other sites have scraped her posts. She puts full links in her posts to other articles of hers so that when a story of hers is posted on another Web site it will link back to her story and she can see the "trackback."

"I make my living from my writing and when people take it because they are ignorant of copyright laws or think that because it's on the Internet it's free, it makes me really mad."
-- Lorelle VanFossen, blogger

She has set up Google Alerts with her byline so that she will get notifications any time Google comes across a news site or blog with a reference to her. She also does a keyword search for her name on Google search, Google blog search and Technorati. In addition, she uses a WordPress plug-in that allows her to insert a digital fingerprint, a series of unrelated words, into her posts that she can search on in case her byline is stripped.

Invariably, VanFossen comes across her posts on other sites. If she hasn't had a problem with the site before, she will send the site publisher an e-mail asking them to please not use her content without her permission. If she doesn't get a response or she has had problems with the site in the past, she sends a "cease-and-desist" letter that informs the owner that they are violating her copyright and warns them she will take legal action under the Digital Millennium Copyright Act (DMCA) (PDF) unless they remove her content.

VanFossen also contacts the company that hosts the Web site, as well as advertisers on that site and search engines, providing the necessary evidence via mail or fax as required. "The DMCA puts the onus on advertisers, Web hosts and search engines to remove copyright violations," she said. "I have a form letter I use."

In December, Michelle Leder, editor of Footnoted.org used a cease-and-desist order to get her content taken off a site that was continuously republishing her posts. "Even the post I wrote about him stealing my content was posted on his site," she said with a laugh.

"It wasn't the issue of money," Leder added. "When other peoples' business model is based on stealing content that's a significant problem."

One site that offers a free service for tracking copyrighted content online is CopyScape. About 200,000 Web site owners use the free service every month and thousands pay for a higher level service, said Gideon Greenspan, chief technology officer of Indigo Stream Technologies, which offers the service.

There are many aggregator Web sites that collect content from a variety of sources, often related to a specific topic area, like real estate or cars around which they can serve contextual ads. While some of the sites reproduce entire blog posts or articles from other sites, others offer just headlines or the first paragraph or a few paragraphs. Many include attribution and a link back to the original article. But providing attribution does not preclude a copyright violation, experts say.

In defence of scrapers
While most publishers of scraper sites stay underground, Michael Gray, a search optimisation consultant who runs GrayWolf's SEO Blog outed himself as a Web scraper in a blog post about a year ago.

"I've moved away from this. It wasn't worth the time and effort of doing it," he said in a recent interview. He said he aggregated "snippets" of other peoples' content so he could flesh out his sites and make money off Google ads.

Gray also downplayed the significance of scraping. "Bloggers have a tendency to over react to things and make mountains out of molehills," he said.

Gray said his sites fell under the Fair Use provision of the DMCA, which allows people a limited use of a copyrighted work without having to get permission. But the nature of the use should be noncommercial, said Dennis Kennedy, an information technology lawyer knowledgeable of intellectual property issues.

"It's extremely difficult to track down the people doing this. And even then you're probably not going to be able to establish jurisdiction if they are outside the U.S.," he said. "It could be more expensive than it's worth and you have to show damages."

Pretty much any site that puts out an RSS feed is going to get scraped, said Jonathan Bailey, Web master of Plagiarism Today. Typically, it's the same people sending out the herbal Viagra junk e-mail, he said.

"The black hat SEOs (search engine optimisers) are doing this to build up Google juice (improve search engine ranking) or display Google AdSense ads," Bailey said.

Not only do scraper bots allows people to grab thousands of posts an hour, but there is software that can pseudonymise it by replacing certain words with synonyms, such as "feline" instead of "cat," Bailey said. This makes it harder for bloggers to track their scraped content.

The scraped site can even appear on Technorati before the original content, he said. And in some cases, images are getting scraped and hotlinked back to the original site, thus depriving that site of bandwidth and costing them money, he added.

Some people point the finger at Google. "They've been slow to shutter a lot of these accounts. It's in their best interest to keep them open for as long as they can, said Bailey.

"Google should do something about this," said Footnoted.org's Leder. "The entire revenue model for these sites is based on Google ads misdirecting content."

But Google has worked to cut back on the problem of Web spam over the last year, said Matt Cutts, a senior software engineer at Google.

"It's true, people can scrape very easily. But it's also much harder to spam than it has been in the past," he said. "For months and months, we've kicked people out of AdSense because they violated our quality guidelines."

Sites being scraped can report it to Google using the tools section on Google's Webmaster Central site and by clicking on an "Ads by Google" ad, Cutts said.

For sites that syndicate their content through feeds, putting a link to the original source article at the top or the bottom with wording to the effect of "this article was originally printed here," will help ensure that Google's search engine displays the original item and not a reproduction on a scraped site, he said.

Not every blogger is worried about scraper sites, though. Om Malik, executive editor of GigaOM.com, said he doesn't waste time going after scrapers. "The reason is that there are so many of those sites like (the Lernaean) Hydra's head, kill one and more pop up," he said.

Like this article? Click below to send it to your mobile for free!

newbie
06/08/2007 03:29 PM

She can use watermark (visible/invisible) to protect the images. Also can be the evidence that the images are belong to her. Easy.

Report offensive content

  • Leave a comment

All fields marked with * are required

What do you think

Your e-mail will not be displayed

You must read and type the 6 chars within 0..9 and A..F

You must read and type the 6 chars.


  • Gmail gets colourful themes

  • Kevin Rudd joins Twitter

  • Gmail gets voice, video chat

  • Google, Telstra sign deal for Yellow Maps

  • Sensis kills its search, uses Google

  • Oi!: MTV Music is, like, the raddest thing ever

  • Britney arrives on Twitter

  • Oi!: An end to drunken, embarrassing emails?

  • Adobe Dreamweaver CS4

More articles »

Find the right software

Brand
  • Multiple options can be selected

    • Adobe Dreamweaver CS4

      Adobe Dreamweaver CS4

      Designers and editors who lean on Dreamweaver for complex dynamic websites will find plenty of tweaks and improvements in version 4.

    • Chrome (beta)

      Chrome (beta)

      Google has rethought the Internet browser — some of its basic underpinnings are quite novel — but users will recognise some features as they exist in other, open-source browsers on the market today.

    • Internet Explorer 8 Beta 2

      Internet Explorer 8 Beta 2

      Microsoft's release should retain its browser base but doesn't yet have enough to lure loyal Firefox users back to Internet Explorer.

    • MobileMe

      MobileMe

      MobileMe is the successor to .Mac, Apple's subscription service for publishing photos and other personal content to the Web.

    • Firefox 3

      Firefox 3

      If only for the speed, lightness of being and security alone, Firefox remains our Editors' Choice for best internet browser.

    More reviews »

    Membership benefits

    Manage and receive subscriptions

    Manage and receive subscriptions

    Choose to receive an e-mail update containing our best articles either daily, weekly or monthly. Sign up for a free CNET Australia membership now!