TV team brings Hansard, one of Canada’s most important documents, into the digital age
Since 1880, every word spoken in Canada’s parliamentary debates has been transcribed and recorded into a massive document called Hansard. To put the size and scale of Hansard in perspective, reading the entire document at a pace of about a novel a day would take 66 years.
But this rich, historical document was becoming less usable every year because of its size – until a group of University of Toronto political scientists, computer scientists and historians decided to intervene.
“You have this really unrivalled historical resource that is accumulated over time, and by virtue of its size and magnitude, is impenetrable,” says Christopher Cochrane, associate professor of political science at TV Scarborough. “That is the status of Hansard prior to digitization.”
In 2013, Cochrane teamed up with two postdoctoral researchers, two PhD students and Graeme Hirst, professor of computer science at TV Scarborough, to create .
LiPaD has digitized and made searchable Canada’s parliamentary debates dating back to 1901. It also created and designed a website to make the documents more accessible to the public, a project headed by PhD student Tanya Whyte.
“Making these data very clearly accessible, very clearly searchable and opening it to everybody basically takes something that was becoming of little use because of its size and makes its use as enormous,” says Cochrane.
With a click, users can also find more information on parliamentarians, such as their party affiliation and gender. The site is continually adding more information on members, including demographic profiles and election outcomes.
Christopher Cochrane, associate professor of political science at TV Scarborough, says LiPaD puts the usefulness of Canada's Hansard on par with its size (photo by Ken Jones)
The process began with Canadiana, a non-profit heritage coalition, which scanned every page of the Hansard and posted them online. But as pictures instead of text, the documents could not be searched with keywords.
The good news for the LiPaD team is they did not have to physically scan the documents, but there were other challenges. Many of the documents, some more than a century old, were physically damaged with specks, bits of dirt or smudges from printing. This made it hard for optical character recognition (OCR) programs, which convert written or printed words into text a computer can read, to correctly register the contents of the pages.
The quality of the documents, particularly stray specks, made it difficult to read French words. OCR settings that allowed French accent marks would also confuse specks for accents. Meanwhile, OCR settings that read only English had trouble reading genuine French accent marks. LiPaD is currently only available for English proceedings, but Cochrane says the team is interested in eventually offering the French proceedings as well.
The OCR would often err with English as well. Hirst says a common stumbling point was in the standard parliamentary phrase “Hon. member,” short for Honourable member. If the “H” was even slightly obscured or broken, the computer would misread the term as “lion member.”
“That’s an easy one to fix, because obviously we would expect there to be zero occurrences of ‘lion member,’” Hirst says. “But it illustrates the kind of low quality that we were up against all over the place, including ones that weren’t so easy to fix.”
To remedy this, Kaspar Beelen, now an assistant professor at the University of Amsterdam, created several rules, allowing the computer to recognize common mistakes and giving it instructions to fix them.
The massive amount of publicly accessible data, which can be downloaded in multiple formats, is a powerful tool for future work.
“If you present the world with an interesting data set, people will find ways to use it that you yourself never thought of,” Hirst says. “I hope that there are people out there doing that with LiPaD right now.”
Ludovic Rheault, now an assistant professor of politicial science at TV, joined the project in 2014 and began conducting applied research projects using the data from LiPaD. In one paper, published in 2016, he used the data to study how the language parliamentarians use in debates can indicate anxiety levels.
Rheault says the intersection of computer science, political science and language represents the most appealing thing he found about LiPaD – the opportunity to work with an interdisciplinary team.
“To grow as a citizen and a researcher, having the ability to look at what people do in other disciplines often times makes you realize that, ‘Oh, I was completely blind or oblivious to this solution or a particular problem,’” he says. “It helps you change the way you see problem-solving in general.”
The project has received funding from the Social Sciences and Humanities Research Council, the National Sciences and Engineering Research Council, and the Digging into Data initiative.