|
|
Desktop ScientometricsESRC Centre for Science, Technology, Energy and Environment Policy,
AbstractAdvanced scientometric tools are moving from the realm of the privileged few with access to mainframe and minicomputers to the desktop of researchers equipped with personal computers. This shift is not only due to the decreasing cost and technological advances in PCs but the ready availability of a powerful multitasking operating system, a versatile text processing language and easy access to the Internet. Furthermore, the latest releases of PC software, such as Microsoft Excel, make it possible to develop graphical user interfaces into complex bibliometric data for a wide spectrum of researchers and policy analysts. Recent developments in digital communication, in particular, tools to access the Internet via the World Wide Web will provide give even greater flexibility to those researchers wishing to make their scientometric data available to a diverse international audience. This paper examines how the BESST project developed a Desktop Scientometric environment using public domain, hardware independent software, prototyped a graphical user interface to provide easy access to UK sectoral level bibliometric data and gives a glimpse into future developments .1. IntroductionScientometrics is a unique research area and scientometricians are a unique breed of scientist who endeavour to quantify national and international systems of innovation to help policy shapers weave the political and economic climate required to nurture an R&D community in hopes of deriving long term economic and social benefits. I shall focus on the bibliometric aspects of scientometrics. Bibliometrics, in general, is the art of exploring publication databases in search of indicators that reflect research productivity and quality as well as the interactions between individuals, groups, institutions, sectors and nations.In 1993 SPRU initiated the Bibliometric Evaluation of Sectoral Scientific Trends better known as the BESST Project 1. This project explores the publication trends from six UK institutional sectors and the collaboration patterns between them. The data source is article, note and review publication types indexed in the 1981-1994 Science Citation Index and extracted from tapes purchased from the Institute for Scientific Information. I shall explain how we used the art of Desktop Scientometrics to manipulate, unify, analyse and display publication and collaboration dynamics in the UK science system using a desktop PC equipped with three public domain software tools: Linux (a Unix operating system), Perl (a powerful text processing language) and emacs (a programmable line editor). Also, I will explain a prototype Scientometric Graphical Interface designed to provide researchers and analysts with an easy-to-use tool to explore sectoral level data 2. Finally, I will explore the expanding boundaries of desktop computing - Internet, World Wide Web, JAVA and distributed databases - and explain how these will fundamentally change the nature of scientometrics and further drive new developments in desktop scientometrics. Until recently, the manipulation of large bibliographic databases, that is more than a few tens of thousands of publications, was out of reach of most scientometricians for three primary reasons:
2.the need to have access to a mainframe or minicomputer with a large complement of main memory and hard disk space; and 3.a limited selection of programming languages suitable for processing text quickly and efficiently. 2. The BESST ProjectIn the Phase I, the BESST Project explored the relationships within the UK scientific community using publication indicators derived from the 1981-1991 SCI. More specifically, the objectives were
2. to map the changes in patterns of inter- and intra-sectoral collaboration in different scientific fields; 3. to investigate changes in the patterns of international collaboration with UK sectors, and; 4. to use the data to investigate policy-relevant questions.
2 develop a graphical interface to provide user-friendly access to sectoral level data, and; 3. explore in detail the publication and collaboration activity of UK industry. Recall, we want to examine publication time trends by sector in each scientific field and to explore changing patterns in intra-sectoral, inter-sectoral and international collaboration. Now, let me ask "how many tables and graphs does it take to reasonably portray publication and collaboration dynamics in the UK science system?" Remember, that each table can be normalised in a number of different ways. For example we may want to express the publication trends by sector in a science field as a percent of UK publications or simply as a percentage of publications in the field. Or we may wish to express the number of industry-education collaborative papers (i.e. papers that list institutions in the education and industry sectors) as a percentage of industry's papers or percentage of education's domestic collaborations. There are thousands of such tables and each one tells a story. However, there are only two of us, Diana and myself, currently exploring the BESST database and between us we can only tell a few of them. We want to make our data available to other researchers so they can tell their own stories or to analysts who wish to answer specific questions about the UK science base. We decided to prototype an easy-to-use graphical interface into the BESST sectoral data complete with on-line documentation using Excel 5.0. I shall describe the graphical interface and database which allows the user to explore about 2,000 graphs and tables before describing the details of the methodology we used to produce the database. Our long term objective is to make the tables and graphs accessible over the net from an interactive World Wide Web server that interrogates an Oracle database. In more detail the BESST Database was built in Excel 5.0 and is composed of a database workbook, a templates workbook and a graphical interface written in Visual Basic for Applications and supplied as an Excel Add-In . The database is read-only and the templates workbook which contains table layout template but no data is password protected. This has been done to maintain the the integrity of the workbooks. The Add-In can not be opened because it contains the compiled version of the software for the graphical interface. The graphical interface is supplied with a complete on-line help system. 4. Desktop Scientometric Tool KitNow, I shall explain in how we unified, manipulated and analysed the SCI data. First, a little background. Before the project was initiated SPRU had already purchased the 1981-1989 SCI on tape from ISI. We obtained funds to update the data through 1991 in Phase I and to 1994 in Phase II, pay the salaries of 1.5 person for 2 years and purchase a PC.In summary, we needed to extract about 500,000 article, note and review SCI publication types from 600 MB of tape data, isolate the names of institutions listing a UK address, manually unify them to a set of standard names and link the standard names back to the original dataset. However, we faced a dilemma. About the time the project began the University of Sussex computing services started to experience a tremendous increase demand on its facilities so large computing resources were scarce. Furthermore, considerable experience had shown us that although Microsoft products running under Windows was adequate for analysing paper, citation and collaboration counts it was not suitable for manipulating large text database. So we decided to built a Desktop Scientometric ToolKit on a personal computer that consisted of three main components: a multitasking operating system, a text processing language and a programmable line editor. I shall briefly describe each of these in turn. Linux Operating SystemLinux 4 is an independent implementation of a POSIX compliant 386 based Unix operating system freely distributed under the GNU Public License 5. Furthermore, much of the GNU project Unix freeware has been bundled in with the Linux distribution. We gambled on Linux since it contained most of the standard string manipulation tools need to manage large datasets. We purchased a 66 Mhz 486 with 32 MB RAM, a 1 GB SCSI hard disk with tape drive and installed Linux. This is a slow and small machine in compared to today's 150 MHz Pentium. Recently we installed a new version of Linux from CD-ROM in less than a day. The documentation is better than older versions but it is still requires that the user have some hardware and Unix system administration knowledge.Perl LanguageMost bibliometricians that work with the ISI tape data write their text processing software either using the built-in string manipulation functions in their relational database or using conventional programming languages like Fortan, Pascal and C. Database languages although ideal for simple text processing tasks can be difficult to use for certain mathematical problems. For example, it is difficult to write a routine using a relational database's SQL to produce co-occurence matrices. Conventional languages are ideal for handle mathematical processes but have poor string handling ability.However, in the late 1980s Larry Wall at the NASA Jet Propulsion Laboratory developed a script writing language for system administrators called Perl 6, the Practical Extraction and Report Language. This language was created specifically for manipulating large files and makes extensive use of the string routines that are inherent in the Unix environment. Perl is now considered to be a 'must know' language for text manipulators and Web site maintainers and developers. Perl is a 'glue' language, that is it can be used as a stand alone programming language or to glue the power of programming languages like C or Fortran to a database like Oracle or Microsoft Access. It is also distributed free under GNU Public Licence. Perl has a rich collection of text handling, pattern matching and array and database manipulation functions. Without getting too technical let me illustrate its versatility for bibliometric analysis. I will briefly describe two functions that were invaluable for constructing the BESST database . Splitting and joining strings is a common text processing function. Assume we have a line of text, perhaps an SCI corporate address, that is composed of five primary fields each separated by a slash ( / ). For example, consider the corporate address from an SCI record that has been assigned to the variable $ISI_line (note the $ which commonly used as the first character in a Perl variable): $ISI_line = "QUEENS UNIV BELFAST, CTR MED BIOL,DEPT PHARM, 97 LISBURN RD/BELFAST BT9 7BL/ANTRIM/NORTH IRELAND/" SPLIT, a Perl function, divides a string into an array of strings delimited by a pattern of characters and has the form
Thus, the Perl command splits the string stored in expression $ISI_line into pieces separated by a slash ( / ) and stores the result in the array @ISI_field as follows: $ISI_field[0] = "QUEENS UNIV BELFAST,CTR MED BIOL,DEPT
PHARM,97 LISBURN RD"
JOIN , the inverse of split, can be used to join an array of strings into one string . It is a laborious and time-consuming task to write general functions like these using conventional programming languages. Fortunately, Perl has a rich set of functions of this type. Associative arrays is the most valuable data structures in Perl. Unlike a conventional array which is indexed by a number, an associative array is indexed by a string. For example an associative array might look like the following: $array{"UNIV SUSSX"} = "UNIV SUSSEX"
The index into the second element of this
associative array is the string "IMPERIA COLL" which contains the string "UNIV
LONDON,IMPERIAL COLL". The latest release of Perl allows arrays with multiple
associative indices. An associative array is useful for holding a thesaurus of
variant and standard institutional names in order to produce unified
bibliometric data.
% UNIV LONDON,IMPERIAL COLL:U
Essentially, our tool kit is complete.
Using Linux, Perl and emacs on a personal computer we were able to manipulate,
unify and analyse 500,000 publications (about 8% of the SCI data) published over
fourteen years so that we could derive the BESST sectoral level data.
The reduced bibliometric database was designed for easy use and to be
imported into a relational database. Each publication record in the database
begins with a unique six character ISI article identifier which
links it back to the original SCI data set. In addition, the number of
citations from the publication year to 1994 has been incorporated to facilitate
citation analysis. The format of a unified publication record is as follows:
ISI code:Journal name:number institutions:number
authors:tape year:source year:publication type:volume:start page
The following example listed two
publications extracted from the 1981 data set:
NF9SEO:CARIES RES:3:1:81:81:N:0015:0070
NK1K1A:BR J HAEM:11:5:81:81:A:0047:0133
One year of bibliometric data in this
reduced form occupies about 4 megabytes (or less than 1 megabyte in compressed
format). Althought there is a fair amount of redundancy in this data structure
it can be eliminated when imported into a relational database.
Also we have established a new relationship
with a super computing group at the University of Southampton. They have an 23
node machine, each node composed of an RS6000 with 128 Mbytes DRAM. This machine
has 4 Gbyte of hard disk and operates at 250 Mflops/s performance (double
precision, peak). It runs a version of Unix optimised for a parallel machine and
it has DB2 IBM's new parallel relational database. We hope to use this machine
in co-operation with some Southampton researchers to explore the dynamics
embedded in citation networks
.
Finally, we are adding two new tools
to our ToolKit: a relational database and an interactive WWW database server. To
do this we purchased a 64 bit DEC AlphaStation with 256 MB RAM and 2 x 5 GB fast
wide SCSI hard drives. We are installing the Oracle relational database with Web
server access. This machine will provide greater speed and better utilities for
explore the BESST data and allow us to prototype a Web-based graphical user
interfaces.
As you can see from our desktop we
have access to an arsenal of computers and so far our ToolKit functions on most
of the machines we use. The boundaries of Desktop Scientometrics have
expanded.
We dream of two things. First, we dream of
building an interactive graphical interface to assist social scientist and
policy analysts from all disciplines to probe the dynamic signature of national
systems of innovation as reflected in publication records. This dream is rich
with interesting research problems. For example, how does one visualise research
co-operation dynamics between 6000 institutions? We are not sure but we have
some ideas for techniques for visualising the collaboration dynamics between six
UK institutional sectors. Exploring collaboration networks is not as difficult a
problem as examining citation networks but even this may challenge the frontiers
of data visualisation and data mining techniques. We realise that this is an
ambitious dream but we believe it is achievable. To accomplish this dream we
will need help from people skilled in the computing and mathematical methods.
Our second dream is grander. We dream of expanding the BESST database
to explore as many national systems of innovation system as possible starting
with the European Union. A project of this nature would require pan European
co-operation involving the whole of the European bibliometric community.
Can you image the possibilities? We could explore the various European
national systems of innovation in a comparative manner. We could examine
the dynamic patterns of interaction between these systems as the European
Community evolved through the 1980s and 1990s. However, there are many
complex issues that would surround a project of this kind such as Intellectual
Property Rights, quality control and data security. We believe that with
a good measure of cooperative effort, a reasonable level of funding and
a lot of hard work from the talented resources in the European bibliometric
community a EuroBESST could be constructed within three to five years.
Ah, but that is just a dream and Desktop Scientometrics is a virtual reality!
2. A demonstration of the BESST database
and graphical interface was presented at the conference. A copy of the database
and interface can be obtained from the authors.
3. For example, the unique article numbers that are supplied on ISI
tape version of SCI are not available on the CD-ROM version. Thus,
there is no easy way to link a subset of papers derived from CD-ROM to the
original CD-ROM articles.
4. Linux runs on an IBM PC compatible with
an ISA or EISA bus and a 386 or higher processor, Apple Macintosh, Amiga, MIPS,
Sun workstations and DEC AlphaStations. The Linux kernel was written by Linus
Torvalds from Finland and by other volunteers. Most of the programs running
under Linux are generic UNIX freeware, many of them from the GNU project.
Linux is available from the following
anonymous FTP sites:
5. Project GNU is organized as part of the
Free Software Foundation, Inc. The Free Software Foundation has the following
goals: 1) to create GNU as a full development/operating system. 2) to distribute
GNU and other useful software with source code and permission to copy and
redistribute.
6. PERL was written by Larry Wall at the
NASA Jet Propulsion Laboratories. It usually bundled in with most Linux
distributions and is available for anonymous FTP from
9. This is a
powerful multiprocessor machine running 12 SuperSPARC processors each with
2 megabytes cache. It has 768 megabyte of memory and 139 gigabytes of fast
access disc space. |
|
contact |