Dictionaries and Copyright Law
Disclaimer: Before starting this I want to make it clear that I am not scraping dictionaries, nor do I plan to. I am part of a project to create an open-source Korean dictionary, and so I’m wondering what the law is so we don’t run into any trouble. The data for the project so far comes from freely-published Korean government data, and manually-made definitions, so we should be in the clear. I’m also interested in what it means to “own” some fundamental parts of language.
There are hundreds of commercially-produced and (I assume) copyrighted dictionaries in many languages. Dictionaries are extremely useful to not only language learners, but developers of language learning tools. However these commercial dictionaries are out of reach of most developers due to copyright or exorbitant licensing costs.
I wonder what the extent of the copyright law is.
The process of creating a new dictionary could be broken into two parts.
- Scrape the data.
- Reformat the data to the extent that it is not a copy of the original.
The first is the clear grey area. “Making unauthorised copies” would probably cover it, but viewing the dictionary itself creates a copy, and whether you store that or not is not known by the other party.
What would be the difference between the following two methods of creating the dictionary:
- Manually — Humans writing definitions of words they know, using existing dictionaries to look up those they didn’t, and writing their own new unique definitions.
- Programmatically — Using existing dictionaries, using paraphrase, language models and bilingual texts to generate new unique definitions.
The first seems completely natural. That’s how new dictionaries get created all the time. The other dictionaries are used for reference, but the definitions are clearly the creations of the new authors, and as such (I assume) there is no copyright infringement.
However the second seems a lot more grey. Despite the fact that the same rewriting/paraphrasing has taken place, as it was automated it seems less clear who owns the copyright.
Does anyone have any answers to this? Past experiences?