Log in

Sign up for our weekly newsletter!

These data were acquired by The Library as part a new data acquisition initiative, piloted in Spring 2016, and are jointly hosted and managed by the D-Lab as a partner in that initiative.

These data all have restrictions on access, use and distribution of the resources.  

Access can be provided to Berkeley affiliates who agree to comply with the relevant Data Use Agreement.  To apply for access, complete the survey linked to under the Data Use Agreement portion for each resource.

Here is a sample of available resources. For a full list, follow the links on the Library Guide to Text Mining and Computational Text Analysis.


India - National Sample Survey - Consumer Expenditure Survey Series

India Annual Survey of Industries, 2008-2013

San Francisco Chronicle Archive

Los Angeles Sentinel Archive

Airbnb data for Six Metro areas, 2014-2016

NYSE ReTrac ProTrac EOD, 2007-2016

Corpus of Contemporary American English (COCA)


 


Title: India - National Sample Survey - Consumer Expenditure Survey Series

Data Use Agreement: < Apply for Access Here >

Description: Surveys of expenditures incurred by households on the consumption of goods and service during the reference period, as well as enterprises owned by households and used by their members during the reference period. Similar surveys are used in the United States and other countries and are used to track consumption and spending patterns of the population and growth patterns for consumer goods.

Detailed metadata can be found at: The international Household Survey Network for:

(Schedule 1.0, Type 1) 1983, 1987-1988, 1993-1994, 1999-2000, 2004-2005, 2005-20062006-20072007-20082009-2010,  2011-2012

(Schedule 1.0, Type 2) 2009-2010, 2011-2012

(Schedule 10.0) 1983, 1987-1988, 1993-1994, 1999-2000, 2004-2005, 2009-1010, 2011-2012

Producers:  Central Statistics Office (Industrial Statistics Wing) - Ministry of Statistics and PI, Government of India

Universe: Households and personal enterprises in India, with exceptions for inaccessible areas.

Geographic coverage: India. The survey covers the whole of the Indian Union except (i) interior villages of Nagaland situated beyond five kilometres of the bus route and (ii) villages in Andaman and Nicobar Islands which remain inaccessible throughout the year.

Time Period: 1983 - 2012, Assorted Years, Annual Surveys

Unit of Observation: Households

Smallest Geographic Unit:


Title: India Annual Survey of Industries, 2008-2013

Data Use Agreement: <Apply for Access Here>

Description: The Annual Survey of Industries (ASI) is the principal source of industrial statistics in India. It provides statistical information to assess changes in the growth, composition and structure of organised manufacturing sector comprising activities related to manufacturing processes, repair services, gas and water supply and cold storage. The survey is conducted annually.

Detailed metadata and documentation can be found at the International Household Survey Network for:

2009-2010, 2010-2011, 2011-2012, 2012-2013

Producers: Central Statistics Office (Industrial Statistics Wing) - Ministry of Statistics and PI, Government of India Citation: Smallest Geographic Unit:

Geographic Coverage: India. The geographical coverage of the Annual Survey of Industries, 2008-2009 has been extended to the entire country except the states of Arunachal Pradesh, Mizoram and Sikkim and Union Territory of Lakshadweep.

Time Period: 2009 -2013, Annual

Unit of Observation The unit of enumeration is the establishment - a factory, workshop, or undertaking - although a consolidated return is permitted for establishments in the same state and industry that share common ownership.

Universe: The survey cover factories registered under Sections 2m(i) and 2m(ii) of the Factories Act, 1948 i.e. those factories employing 10 or more workers using power; and those employing 20 or more workers without using power. The survey also covers bidi and cigar manufacturing establishments registered under the Bidi & Cigar Workers (Conditions of Employment) Act, 1966 with coverage as above. All electricity undertakings engaged in generation, transmission and distribution of electricity registered with the Central Electricity Authority (CEA) were covered under ASI irrespective of their employment size. Certain servicing units and activities like water supply, cold storage, repairing of motor vehicles and other consumer durables like watches etc. are covered under the Survey. Though servicing industries like motion picture production, personal services like laundry services, job dyeing, etc. are covered under the Survey but data are not tabulated, as these industries do not fall under the scope of industrial sector defined by the United Nations. Data Types:


Title: San Francisco Chronicle Archive

Data Use Agreement: < Apply for Access Here >

Description: ProQuest Historical Newspaper data for the San Francisco Chronicle, 1865-1922, OCR'ed content (results from automated Optical Character Recognition - quality varies). See Proquest for descriptive advertisement.

Producer:

Distributor: ProQuest

Time Period: 1865 - 1922

Unit of Observation: Text Corpora from scanned newspapers

Universe: All San Francisco Chronicle content.


Title: LosAngeles Sentinel Archive

Data Use Agreement: < Apply for Access Here >

Description: ProQuest Historical Newspaper data for the los Angeles Sentinel 1934 - 2005, OCR'ed content (results from automated Optical Character Recognition - quality varies). See Proquest for descriptive advertisement.

Producer:

Distributor: ProQuest

Time Period: 1934- 2005

Unit of Observation: Text Corpora from scanned newspapers

Universe: All Los Angeles Sentinel content.


Title: Airbnb data for Six Metro areas, 2014-2016

Data Use Agreement: < Apply for Access Here >

Description: Airdna analytics and reports are based on Airbnb data gathered from information publicly available on the Airbnb website, and tracks the performance Airbnb listings each day with occupancy rates and revenue data. Airdna machine learning technology imputes locks of unavailable dates observed on Airbnb’s platform as either booked by a customer or blocked by the host.  Methodology described here.

 

Producer:

Distributor: Airdna

Time Period:  August 2014 - 2016

Unit of Observation

Universe: Airbnb listings and imputed revenues from six areas: San Francisco Bay Area; New York City; Chicago, Portland OR, Los Angeles, and Houston.


Title:NYSE ReTrac ProTrac EOD

Data Use Agreement: < Apply for Access Here >

Description: At approximately 8:00PM each day, a file will be made available that contains a summary of the ReTrac activity during the day for each stock and identifies the volume of retail buy and sell shares executed on the NYSE. At approximately 8:00PM each day, there is also a file made available that includes a summary of all program trading activity for each stock and identified the amount that was executed for all index arbitrage program trades from that of all non-index arbitrage program trading. More information here.

ProducerNYSE Market Data

Distributor: NYSE Market Data

Time Period:  Jan 2007 - Dec 2016

Unit of ObservationDaily Summary Statistics of Trading

Universe: NYSE Daily Trades summary


Title:  Corpus of Contemporary American English (COCA)

Data Use Agreement: < Apply for Access Here >

Description: The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English.

The corpus contains more than 520 million words of text (20 million words each year 1990-2015) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

More information here.

ProducerMark Davies, Brigham Young University

Distributor: Mark Davies, Brigham Young University

Time Period:  Jan 2007 - Dec 2016

Unit of ObservationDaily Summary Statistics of Trading

Universe: NYSE Daily Trades summary