Crowd-sourcing annotation tasks
Crowd-sourcing annotation tasks on online platforms.
Last updated
Crowd-sourcing annotation tasks on online platforms.
Last updated
To annotate a small amount of data, one can just do it on their computer. For this, the default interface is supposed to be a perfect match. At the other end of the spectrum, some annotation tasks are just too huge, and need to be crowd-sourced. Amazon Mechanical Turk (mturk) is the perfect service for that. It comes in two sides. A "requester" is defining a set of tasks while a "worker" is performing those tasks. Workers are payed by requesters through mturk service.
Mturk is based on the concept of a "HIT" (Human Intelligence Task) as the task unit. The simplest way of quantification in our case is one image ↔ one HIT. In the following, I will describe first how to set up accounts on mturk, then how to use this annotation application for the HITs.
Defining "real" tasks on mturk is obviously going to cost you money. But fortunately, mturk also have testing equivalent of the normal requester and worker environments, called sandbox environments. So the first thing to do is registering for a requester sandbox and a worker sandbox account.
As a requester, create a new project by going to the "Create" tab and clicking on "New Project".
Go to the "Other" category and click on "Create Project >>".
In the tab "(1) Enter Properties", fill the fields as you wish but you need to set a "Reward per assignment", and at the question "Require that Workers be Masters to do your HITs", answer "No" to be able to test your own HITs. Once filled, click on the bottom button "Design Layout" to validate and go to the next tab. You will get a page looking like this one.
Now click on the "Source" button to edit the source code, and replace the code by the following one. The structure of this chunk of code is explained in following sections.
Now your editing area should look like the following.
If the "Source" button is not shadowed, it means that mturk had a bug. It happens sometimes, due to sign out or other mysterious mturk events. If your editing area looks like this, with the "Source" button shadowed and the code pasted in, we can continue. Click again on the "Source" button to leave the source mode. Now you should have something looking like the following, which is totally normal.
Now click on the bottom "Preview" button. It moves you to the third tab and you should get a notification like "Your project was successfully saved." with a preview of what your HIT would look like. The preview should be empty. Again, this is normal, don't panic ;). In case you open your JavaScript browser console, you will even see a runtime error, of the kind "SyntaxError: missing } after ...". This error is due to mturk templating system. You will understand the reason for this in the next section. Now simply click on the bottom "Finish" button and you are done with the project template. You should be back to the "Create" tab with a new project entry looking like the following.
Now that our template is ready, we can generate all our HITs with just one CSV file. Click on the "Publish Batch" button to create a batch of HITs. Load a CSV file containing the following text (with no extra space! be careful).
Mturk will process the csv file, make the entries match to our template html, and generate template previews that should be correct this time around.
Hit (pun intended) the bottom right "Next" button. On the next page, adjust the batch name and other fields, and hit "Publish". That's it! Your tasks are now published, and should be available in roughly a minute to sandbox workers. In your "Manage" tab, you can now follow the progress of your HITs.
Now that your batch is published, simply connect to the worker sandbox site and look for your tasks. I'll let you explore on your own.
Please read first the Getting started page to understand this section.
Just like with the normal version of this app, the mturk version can also be configured to only display the tools you need for your HITs. Let's have a second look at the template html file.
There are four important parts in this document.
The <input>
element
The #mturk_form
display style
The const img
variable
The const flags
variable
This whole template is embeded inside a form, in an iframe in the mturk worker website. The id of the form provided by mturk is mturk_form
. When our application takes control of the iframe (const app = Elm.Main.fullscreen(flags);
), it leaves the form and our scripts tags aside and starts a new hierarchical DOM element. So the DOM ends up with a structure like the following:
Therefore our <input value="" id="annotation-data"/>
is located inside the form and when the form will be submitted, the content of the input value
will be submitted to mturk. The id annotation-data
is thus important since it is used inside our ports-mturk.js
script to update its value when clicking on the "Submit" button. The style #mturk_form { display: none }
is applied because we do not want to clutter the interface with predefined mturk form, we want to manage the whole thing in our elm application.
Each HIT corresponds to a different image to work on. If you remember, in our CSV file, we have the following entries:
Those img_width
, ... correspond exactly to the html template variables used in the definition of the img
variable:
When starting the elm application, we do the following:
In case you were wondering, yes, the "normal" application and the "mturk" version are actually the same application, started with different "flags". The "normal" application, introduced in the Getting started page is started with the flags:
The only thing that the mturkMode
does is removing the buttons to load a config, load images, and replaces the export (image) button by a textual "Submit" button more familiar to mturk workers.
In conclusion, any config that you can use in the "normal" application, you can use here by just putting it inside the multi-line config string. So refer to the Getting started section to know how to choose the configuration that best suits your needs.
Mturk website is accessed through a secured SSL layer in HTTPS (not HTTP). So every request happening in our iframe must also be sent through an HTTPS connection. This has two implications.
All images addresses provided in your CSV file must be with an HTTPS address
In case our website is down (https://annotation-app.pizenberg.fr) the scripts tags whith JavaScripts files hosted at our website (Main.js, elm-pep.js, ports-mturk.js) will fail loading, and your workers won't have anything displayed. To be safe, you can decide to host yourself those files at an address of your choosing as long as it is HTTPS.