
Boundaries before the scrape
Five fields that turn consent into a practice
Draft a community-defined boundary statement for a real corpus — before any model touches it.
Reading
Most extractive harms happen quietly, at the moment a corpus is collected. By the time it is in a model, the agreements that should have governed its use are absent. Tang names the inversion plainly: a better approach is not to scrape first and ask questions later.
Boundaries are most useful when they are specific and signed before collection. Five fields are usually enough: what may be used, for what purpose, under whose review, with what benefit returned to the source, and with what right to revise, reject or withdraw.
These fields are not a legal contract. They are a practice. They become real when a named human is willing to be the reviewer, the benefit is something the community recognises as a benefit, and the withdrawal pathway is something a working engineer can actually execute.
“Begin with community-defined boundaries: what may be used, for what purpose, under whose review, with what benefits and with what right to revise, reject or withdraw.”
Handouts for this lesson
Practise
Exercise
Fill in the boundary worksheet
- 01Pick a real corpus you or a colleague is involved with — a recordings archive, a list of place-names, an interview transcript set.
- 02Open the boundary worksheet (linked below) and complete the five fields. Use short sentences. If a field is hard to answer, that is the most important answer to write down.
- 03If you can, send the draft to one person from the community the corpus represents and ask them to redline it. Edit. The first draft is never the final one.
Knowledge check
Which of these is NOT one of the five community-defined boundary fields?
When should boundaries be set?