Data

I am committed to making all the data I harvest, curate and analyse available for others to explore and extend.

My major data project to date is “To be continued . . .: The Australian Newspaper Fiction Database,” which presents extensive bibliographical data and text files for over 21,000 novels, novellas and short stories discovered in 19th and early 20th century Australian newspapers. You can use the database to:

  • Explore: search for fiction by author, title, newspaper, keyword, nationality or more to discover fiction
  • Correct: access Trove to read and correct story text or edit the database
  • Add: identify and add new instalments and stories you find in the digitised newspapers
  • Export: download a story you have corrected or generate textual and bibliographical data for research

Other available datasets include:

Austlit datasets: Why have I made them available?

These datasets represent months of work collecting and collating the information in AustLit on Australian novels, and significantly expanding this information to enable empirical analysis of major research questions in Australian literary studies. There are still many other research areas that could be investigated using these datasets. Why, then, am I making them publicly available, rather than continuing to analyse them, and publish the results, myself?

Part of the answer to this question is that there is far more information in these datasets than one person could hope to explore and interpret in a lifetime. I hope, in making them available, to increase the likelihood that others will feel motivated to analyse and interpret this information to enhance our understanding of the history of the Australian novel.

Making these datasets freely available also contributes to what I see as fundamental methodological imperatives of quantitative literary scholarship: openness, testability and accountability. No one would publish a work of literary criticism about a text that no one else has access too. Readers of that criticism need to read the text to consider whether they agree with the interpretations offered. Likewise, it is necessary that the ‘source texts’ – the datasets – used in quantitative literary studies are available, so that others can explore and query the nature of the data, the interpretations presented, and in so doing, assess the arguments made and, if necessary, challenge them.

How were these datasets created?

The first two datasets are based on data in AustLit, supplemented by further research. I extracted the records for these datasets using the following steps:

  1. Performing guided searches in AustLit, asking for Type – ‘single work’ – and Form – ‘novel’ – records for particular year ranges;
  2. Displaying these results as tagged text (NOTE: during the period when I created and updated these datasets – January 2007 to December 2011 – AustLit would not display more than 999 records as tagged text; as long as this remains the case, those wishing to extract data via this process will need to design searches that return less than 1000 results);
  3. Copying and pasting these results into a text file;
  4. Using command lines in terminal to group the data and then copying and pasting the results into Excel.

This process left me with Excel files that initially included the type, title, author, year of publication, publisher and genre/s for Australian novels first published between 1830 and 1899 and between 1945 and 2009. I then added information to the datasets as my research developed and specific questions emerged (for full descriptions of the content of these datasets see below).

The third, fourth and fifth datasets were created with Dr Tara Murphy, who works in the Schools of Information Technologies and Physics at the University of Sydney, and research assistant Jonathan Hutchinson, then an Honours student in the School of Information Technologies. Directed by Tara, Jonathan wrote a script that automatically extracted the ‘works about’ Australian novelists from AustLit, and then Tara analysed the results to produce these datasets. The third dataset – ‘Critical attention to Australian novelists overall, 1945 to 2009’ – shows the results of the overall results of this extraction, and lists for each year in this period, the first fifty Australian novelists ranked in order of the number of ‘works about’ they received. The fourth and fifth datasets show these results, from 1950 to 2009, for ‘works about’ published in, respectively, newspapers and academic journals. Identification of publications as newspapers or academic (peer-reviewed) journals was done manually (NOTE: regarding academic publications, titles were categorised retrospectively, based on whether they were peer-reviewed in 2007).

What information is contained in the different datasets?

  1. Australian Novels, 1830 to 1899
  • TYPE – a description of the novel based on the forms in which it was published, and their order, including: ‘Book Only’ (for titles only published in book form); ‘Serial Only’ (for titles only published in serial form); ‘Serial then Book’ (for titles published first as serials then as books); ‘Book then Serial’ (for titles published first as books then as serials); ‘Same Year Book and Serial’ (when serial and book versions of the novel came out in the same year); ‘Book Also Serial’ (in the one case where I could not determine whether the book preceded, or came out in the same year as, the serial); and ‘Serial Much Later Book’ (when book publication followed many decades after initial serialisation).
  • 1ST PUBLISHED – the year when the novel was first published
  • 1ST BOOK – where relevant, the year the novel was first published in book form
  • 1ST SERIAL – where relevant, the year the novel was first published in serial form
  • TITLE – the title of the Australian novel
  • AUTHOR – the author/s of the Australian novel
  • GENDER – the author’s gender, based on the listings in AustLit and, where none was included, on further research. To avoid mis-categorising pseudonymous authors I never assumed that an author was male or female based on their name. In cases where I could not determine if an author was male or female – or when the novel was written by two or more authors of different genders, I categorised the author’s gender as ‘Unknown’.
  • IMPRINT – the publishers’ name, as listed in AustLit. When the novel was published as a book, I listed the book publisher’s imprint, regardless of whether the book was first serialised (this is in accordance with AustLit’s listing of publishers); when there was no book publisher I listed the periodical publisher
  • BOOK PUBLISHER – some companies publish books under more than one name; this distinction is not so relevant in the nineteenth as in the late twentieth century; however, where relevant, this column identifies the company behind the different imprints. In some cases I have standardised the listing in this column, including when the title was jointly published by two or more book publishers (listed as ‘Joint publishing agreement’); when the title was self-published (listed as ‘The Author’) or published by an individual (listed as ‘An individual publishing another individual’s work’). When the novel was only published in serial form, I listed the BOOK PUBLISHER as ‘N/A’.
  • PERIODICAL PUBLISHER – the name of the periodical where the novel was serialised. When the tile was only published in book form, I listed the PERIODICAL PUBLISHER as ‘N/A’.
  • DETAILS – this column contains my notes on the book or periodical publishers, gathered from AustLit and other sources. In cases where I was relatively sure but not certain of details about a publisher I indicated this with a question mark after the description. For titles that were published by two or more publishers, the detail field is marked ‘Joint’.
  • BOOK PUBLISHER NATION – the ‘nation’ (I use the term Australia for all the colonies) where the book was first published.
  • PERIODICAL PUBLISHER NATION – the ‘nation’ (again, I refer to the colonies as Australia) where the serial was first published.
  • PLACE – the city or town where the book was published (except for titles jointly published by companies from different ‘nations’).
  • BOOK VOLUME – for those Australian novels that were published in book form, lists whether they were published in part issue, or as one, two or three volumes. When the novel was only published in serial form I listed the BOOK VOLUME as ‘N/A’.
  1. Australian Novels, 1945 to 2009
  • TYPE – a description of the novel based on the form in which it was originally published, including: ‘Book’, ‘Book Section’, ‘Edited Book’ and ‘Serial Novel’.
  • TITLE – the title of the Australian novel
  • AUTHOR – the author/s of the Australian novel
  • GENDER – the author’s gender, based on the listings in AustLit and, where none is included, on further research. To avoid mis-categorising pseudonymous authors I never assumed that an author was male or female based on their name. In cases where I could not determine if an author was male or female – or when the novel was written by authors of different genders, I categorised the author’s gender as ‘Unknown’.
  • YEAR – the year in which the novel was first published.
  • IMPRINT – the publishers’ name, as listed in AustLit.
  • PUBLISHER – some companies publish books under more than one name; where relevant, this column identifies the company behind the different imprints. In some cases I have standardised the listing in this column, including when the title was jointly published by two or more book publishers (listed as ‘Joint’ and when the title was self-published (listed as ‘The Author’).
  • NATION – the country where the novel was first published.
  • DETAILS – this column contains my notes on publishers, gathered from AustLit and other sources.
  • SUBSIDY – titles where the cost of publication have been subsidised by the author are indicated with ‘Yes’
  • PLACE – the city or town where the book was first published.
  • GENRE – ‘Yes’ or ‘No’ indicates whether the title is ascribed a genre in AustLit.
  • GENRE 1 / GENRE 2 / GENRE 3 – lists the genres allocated to the title in AustLit.
  1. Critical Attention to Australian Novelists Overall, 1945 to 2006

This dataset is comprised of 103 columns: the first lists the year (from 1945 to 2006); the rest of the columns are in pairs: the first pair lists the ‘total’ number of ‘works about’ Australian novelists extracted from AustLit for that year; for the remaining pairs, the first column lists the number of ‘works about’ received and the second lists the author who received those works.

  1. Critical Attention to Australian Novelists in Newspapers, 1950 to 2006

This dataset is comprised of 103 columns: the first lists the year (from 1950 to 2006); the rest of the columns are in pairs: the first pair lists the ‘total’ number of ‘works about’ Australian novelists extracted from AustLit for that year that were published in newspapers; for the remaining pairs, the first column lists the number of ‘works about’ identified in newspapers and the second lists the author who received those works.

  1. Critical Attention to Australian Novelists in Academic (Peer-Reviewed) Journals, 1950 to 2006

This dataset is comprised of 103 columns: the first lists the year (from 1950 to 2006); the rest of the columns are in pairs: the first pair lists the ‘total’ number of ‘works about’ Australian novelists extracted from AustLit for that year that were published in academic (peer-reviewed) journals; for the remaining pairs, the first column lists the number of ‘works about’ identified in academic journals and the second lists the author who received those works.

As I discuss in Reading by Numbers, the capacity of these three final datasets to indicate critical attention to Australian novelists is inhibited by the original search parameters used. We defined an ‘Australian novelist’ as anyone who had written at least one ‘Australian novel’. As a result, certain authors – such as David Williamson, Dorothy Hewett, Judith Wright and Les Murray – appear near the top of the critical rankings. However, this position arises from discussion of their poetry or plays, not their novels. In an effort to improve the capacity of the data to indicate authors known as novelists, we removed from the results all authors with only one novel: it is those, modified results that appear here.

For the purpose of my analysis in Reading by Numbers, this modification created another problem: namely, although I knew from our original analysis of the data that Helen Darville/Demidenko had received a large amount of critical attention (relating to her novel, The Hand that Signed the Paper) she was no longer included in the results. To avoid missing this author, when analysing the results for critical attention to Australian novelists, I manually counted the ‘works about’ Darville/Demidenko (overall, in newspapers and academic journals) to include her in the rankings for these different categories.

Copyright 2017. These datasets are shared under the Creative Commons Attribution 4.0 International License, meaning they can be used, adapted and added to, as long as the original source is cited.

 

 

Advertisement