Sept. 12, 2018
By: Michael Feldman
Google has added a Dataset Search app to its empire of web-based tools, making it easier for researchers, analysts, and anyone else with an affection for data to access publicly available repositories.
According Google AI research scientist Natasha Noy, the new search tool enables anyone to find relevant datasets based on input keywords. Unlike domain-specific tools, like Data.gov, which work on specific datasets curated by the provider, Dataset Search operates more generically, directing you to the site where the data is hosted.
For this all to work smoothly, Google has encouraged dataset providers to adhere to a common set of guidelines that describe their data in a way that is friendly to search engine technology. Noy writes:
These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. We then collect and link this information, analyze where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. Our approach is based on an open standard for describing this information (schema.org) and anybody who publishes data can describe their dataset this way.
For those not familiar with the schema.org standard, it provides a vocabulary for describing structured data on the Internet (or even in offline repositories). It was initially developed by Google, Microsoft and Yahoo. Undoubtably, one of the reasons Google chose this standard for its Dataset Search is because it’s already being used by over 10 million sites.
Noy claims that most datasets in the environment and social science domain, as well as those provided by many government agencies and news organizations are visible to the new tool. Some initial examples include datasets from NASA, NOAA, Harvard’s Dataverse, and the Inter-university Consortium for Political and Social Research (ICPSR). As other dataset owners add the schema.org support to their sites, more repositories will become accessible.
However, the Dataset Search experience is not nearly as intelligent, nor as frictionless as Google’s generic search. A search for “TOP500 supercomputers” did not yield a pointer to our site and its datasets (we’ll have to fix that). In this case, the top result returned was “Rolling Stone’s Top 500 Albums” and the only relevant result returned was an outdated TOP500 list curated by figshare.com. The more generic problem is that once you navigate to the source, you’re in the hands of the data provider and whatever cockamamie interface they’ve come up with to access their repository.
Nonetheless, it’s easy to see the potential of this tool, especially for the research community, where public datasets are commonplace. This is especially valuable at a time when machine learning and other high-end data mining applications are creating new opportunities for scientists, analysts, journalists, and even entrepreneurs who can find a way to monetize freely available data.
Of course, for Google, Dataset Search is another potential source of ad revenue and other schemes to monetize user data. At the same time, it’s completely faithful to the company’s stated mission to “organize the world’s information and make it universally accessible and useful.”
The tool is currently in beta release and can be accessed here.