Home

Systemic bibliometric indicators for the knowledge-based economy

Paper presented at OECD workshop on New Indicators for the Knowledge-Based Economy

Paris, June 19-21, 1996

Diana Hicks & Sylvan Katz
d.hicks@sussex.ac.uk
j.s.katz@sussex.ac.uk
STEEP Centre
Science Policy Research Unit
University of Sussex
Falmer, Brighton, BN1 9RF

September, 1996


<


Introduction

To understand economies and societies becoming increasingly knowledge-based, we need to understand the structure and use of the stock of knowledge. Scientific and technical knowledge has traditionally been validated and distributed through publishing. Happily, from the point of view of developing indicators, publishing leaves a long-lived paper trail that can be used as a proxy for the stock of knowledge. Papers are particularly valuable as the basis for indicators because they not only represent an increment to publicly available knowledge (indicating output), they can also be graded by impact (though not - alas - quality), and they contain traces of links between institutions. Jointly authored papers reflect collaborative research which is one indicator of links between researchers (Katz & Martin, 1996) in, for example, industry and universities. The references and citations in papers indicate use of research by others enabling analysis of, for example, the extent to which industry relies on domestic versus foreign sources of knowledge (Hicks et al., 1994). Potentially, the publishing archive might even reveal the movement of researchers among institutions and sectors. Thus bibliometric indicators can track the institutional linkages crucial to realising spillovers and the possibly strong multiplier between public institution research and commercial industrial development (OECD, 1992, p. 127). Systemic bibliometric indicators allow us to examine the dynamics of the research-based system of knowledge development, distribution and acquisition enabling us to map the structure of knowledge resources in the economy and society as a whole.

This paper examines the potential for systemic analysis of publishing activity to provide new indicators for the knowledge-based economy. It first examines the state of the art in bibliometric indicators published as regular series. It then proposes that current policy concerns with national innovation systems would be better served by what we call "systemic" bibliometric indicators. Succeeding sections define systemic indicators and give examples of their use from our recent work. Finally, the paper examines the possibility of producing regularly published systemic indicators. First data sources and their advantages and disadvantages are discussed followed by the means and difficulties of setting up indicators of this type.

The state of the art in bibliometric indicators

For many years, bibliometric indicators have been regularly published by the US National Science Foundation in their science and technology indicators (National Science Board). The European Union's first science and technology indicators report included bibliometric indicators and its successor will as well (European Commission, 1994). The firm foundation bibliometric indicators have, alongside patent and R&D expenditure data, within science and technology indicators provides a good basis from which the state of the art can be extended.

The existing indicators are, in the main, compiled at the national level. For each country of interest several statistics are generated: the nation's rate of publishing, the rate at which their researchers collaborate internationally and the extent to which their papers are cited. Each of these characteristics can be examined across nine or ten main science fields (biology, physics, chemistry, etc.) and changes over time can be tracked for more than a decade. Using these indicators policy makers can assess whether the quantity and quality of their country's output is increasing or decreasing relative to that of other countries.

Data concerning the internal workings of national systems are more limited. The NSF has incorporated one table of sectoral publication and citation counts. The EU report included a table listing the largest publishing institutions in various countries. The paucity of regularly published indicators examining the internal workings of national science is unfortunate given current interest in national systems of innovation (Nelson, 1993, Lundvall, 1992). Such analysis focuses our attention on the role played in innovation by a nation's economic and social institutions and the interactions between them and with foreign institutions. Indicators produced at the national level cannot inform about institutions or sectors. Existing indicators at the sectoral level, such as OECD expenditure data, cannot track

Systemic bibliometric indicators

We define systemic indicators to be sectoral level indicators based on institutional data that are comprehensive, time series and track interactions between institutions. Sectoral indicators based on institutional level data are those where data are disaggregated below the national level, but not to the departmental or individual level. Institutional level data unification (see below) is needed even if results are to be reported at the sectoral level to permit institutional interactions to be tracked, to insure accuracy and to enable systemic analysis - that is of small as well as large organisations. Comprehensive indicators include all institutions, not just the biggest. Often studies of innovation at the institutional level, whether of companies or public sector laboratories, have looked at large institutions. Thus we can end up believing that the British science system comprises Oxford, Cambridge, Imperial College, ICI, Glaxo-Wellcome and GEC. Understanding the role of these institutions is important because they are so large, however they have been relatively well studied because so visible. To complete our knowledge of the British system we need to somehow come to understand the role and status of the other 5,900 institutions that have published scientific papers in the UK since 1981. Comprehensive indicators are the only means to see and hence appreciate the diversity in modern knowledge production systems.

Systemic indicators are also longitudinal which is an aspect of being comprehensive. A one year snapshot of the system may seem to be an economical way to obtain most of the information. However, in some ways, the first year of data is the most expensive to generate. The most recent years of data cost more, and once the system is in place to produce one year of systemic indicators, only more research assistant time is needed to do the job thoroughly and generate a decade or more of data. One year of data leaves ambiguity and open questions, whereas a decade or more of data enable accurate interpretations of the present. The effect of policy on systems remains an open question, with systemic data the extent of path dependence in the system and thus the scope for policy action can be probed. With long time series the balance between self organisation and policy management can be investigated. Furthermore, because science systems are so complicated their trajectories appear to be rather stable over time, thus trends can be projected to predict the future.

Systemic indicators also track interactions between researchers as evidenced in collaborative papers. In producing jointly authored scientific papers, researchers exchange tacit and embodied elements of knowledge. In fact these elements are most effectively exchanged in networks based on long term relationships between experts such as those that result in collaboration (OECD, 1992p. 70-71). Bibliometric indicators can track these interactions over time and across an organisation or sector or nation. This enables us to ask questions such as: who does industry collaborate with more than expected? How is this changing over time? How does this differ by industrial sector? It has enabled us to identify the weakening links between industry and hospitals in the UK. No other indicator or research method can provide such a longitudinal overview of institutional links in knowledge production.

Systemic bibliometric indicators track the dynamic system generating and diffusing scientific and technical knowledge through publishing. They map one facet of the structure and circulation of knowledge resources throughout the economy and society. Scientific and technical knowledge is advanced not just by firms but also by universities, hospitals, government laboratories and non-profits. Scientific and technical knowledge is applied not just by firms, but also by hospitals. All these institutions publish, allowing us to glimpse research activity wherever it takes place. Bibliometric indicators allow us to see some of the complementarities, synergies and exchanges manifested in research collaboration. Finally, they indicate how much an institution's or sector's published research output is used by others, and who is using what. Decades of stable bibliometric indicators can be constructed allowing the evolution of the system to be understood.

What bibliometric indicators do not indicate

Unfortunately, bibliometric indicators cannot perfectly capture all knowledge production in a society and inform us of its quality. As with any indicator, they fall short of the ideal in several ways. First, papers represent the output of laboratory-based activity. Desktop innovation, information and software related, is not published to the same degree. Thus, these data make invisible this large and growing segment of knowledge production.

Second, there is not a one to one matching between publication output and R&D expenditure. University faculty have incentives to publish while industrial researchers do not. Publication takes second place to screening and secrecy or appropriation in industrial and military research and to production of maps, reference works or service to industry in some government research. On the other hand, our data indicate that papers are produced from settings where no formal R&D is recorded by statisticians. Thus publication output by no means equates to R&D activity. Rather publishing equates to producing publicly available, research-based, codified knowledge.

Such information is but one component of knowledge which also has tacit and material elements. The codified element has the advantage of being easily distributed and so diffuses far and wide. Thus papers help diffuse knowledge by conveying useful information but this is not all; they also act as signals. Neither the material nor tacit components of knowledge can be communicated in a publication. However, a paper describing research points to these other elements and thus indicates that the authors possess certain tacit knowledge, materials and devices. Readers learn the area in which the researchers work, the names of the materials used, the techniques used to manipulate them, and the astute reader assesses the technical quality of the work. Readers are alerted to the existence of underlying tacit knowledge, skills, substances and so on possessed by the authors. Published papers thus point to unpublishable resources, so papers indicate both the production of new information and presence of scientific and technical capability residing in tacit knowledge, skills, materials and devices (Hicks, 1995).

Third, bibliometric indicators do not represent all publishing. The indicators are based on one American produced database, the Science Citation Index (for reasons explained below). Although the SCI is international in coverage, it has a certain amount of American bias. It contains more minor US journals than minor European journals, and non-English language journals are not as comprehensively indexed. The SCI also does not go into great depth in the trade and technical literature. The 3000 or so SCI journals were selected in the first instance because they have a high international impact. Indeed, coverage of the database has been criticised because the criteria for the inclusion of second-rank journals are inconsistent and applied fields are not well covered (European Commission 1994, 33-34). In addition, only articles, notes and reviews are counted in bibliometric indicators, because they are most likely to report substantial research results and be peer reviewed; discussions, letters, editorials and meeting abstracts are excluded. From the non-US perspective then, bibliometric indicators represent only international level, predominantly English language, higher impact, peer-reviewed, publicly available scientific and technological research output.

Finally, citation counts cannot tell us about the "quality" of a piece of research. Ideally, we would like to be able to know which work is of high quality and which not. Citation counts can only give us a rough indication of the impact research has had on work that follows. Since knowledge is produced by communities however (Kuhn, 1962), impact is precisely what counts. As Latour says:

There is something still worse, however, than being either criticised or dismantled by careless readers: it is being ignored. Since the status of a claim depends on later users' insertions, what if there are no later users whatsoever? This is the point that people who never come close to the fabrication of science have the greatest difficulty in grasping. They imagine that all scientific articles are equal and arrayed in lines like soldiers, to be carefully inspected one by one. However, most papers are never read at all. No matter what a paper did to the former literature, if no one else does anything with it, then it is as if it never existed at all. You may have written a paper that settles a fierce controversy once and for all, but if readers ignore it, it cannot be turned into a fact; it simply cannot. You may protest against the injustice, you may treasure the certitude of being right in your inner heart; but it will never go further than your inner heart; you will never go further in certitude without the help of others. Fact construction is so much a collective process that an isolated person builds only dreams, claims and feelings, not facts. (Latour, 1987, pp. 40-41)

Thus bibliometric indicators are not perfect, but they do permit us to examine several key facets of an important part of knowledge production in modern society. The next section discusses several examples of systemic bibliometric indicators drawn from our recent work to suggest the types of analysis that are possible.

Examples

The bibliometric evidence used here is drawn from the project on the Bibliometric Evaluation of Sectoral Scientific Trends (BESST) at the Science Policy Research Unit (SPRU). BESST was supported by the UK Office of Science and Technology, Department of Trade and Industry, Department of Health, Medical Research Council, Engineering and Physical Sciences Research Council, and the Economic and Social Research Council. This project analysed 14 years of UK natural science output as indexed in the Science Citation Index (SCI) at the sectoral level adhering to de facto standards in the bibliometric community (Katz et al. 1995). Information on all papers indexed in the SCI and listing a UK address from 1981 to 1994 was purchased from the Institute of Scientific Information on tape, and three document types were extracted - articles, notes and reviews - as these tend to report original, substantive research results. UK addresses on the one-half million papers we processed were assigned to one of approximately 5,900 unified institutional names, and each institution was assigned to one of six institutional sectors. Thus, for every paper produced in the UK during the period and indexed in the Science Citation Index, we know with which institutions its authors were affiliated. Because the database is completely unified, we can identify every collaborative paper produced. Institutionally collaborative papers are defined as those listing addresses from different institutions. Papers listing different department names within one institution, or different sites of one company are not considered to involve an institutional collaboration.

Collaborative papers were "whole counted", meaning that the figures reported for a sector are based on counts of papers that list an address from at least one institution in that sector. The figures should be interpreted as the number or percentage of papers in which a sector participated. This straightforward and intuitive method of interpreting publication figures facilitates analysis of the collaborative component of national scientific output by making it visible. In tables produced by whole-counting, all papers are counted once in the national total, but papers that involve a collaboration between sectors are tallied in two or more sectoral counts. For example, a paper that lists a university address and a company address contributes one to the national total, one to the university total and one to the company total. The arithmetical consequence of this is that figures for two sectors cannot be added together, because the sum would "double count" papers produced in collaborations between institutions in the two sectors.

The portrait that emerged from the study was that of a dynamic, diverse and adaptive UK science base. In the next century British scientific research will be more interdisciplinary and will be performed mostly in distributed networks. Over 5000 institutions, large and small, generate fundamental knowledge, skills and capabilities across the biomedical, physical and environmental sciences. Industry and hospitals are integral to this system and help funnel this knowledge into application. About 60% of the UK papers list non-academic addresses, and of these non-academic institutions about 60% are hospitals or companies. (Also about 60% of UK publications in these journals name an academic institution.) Clearly, universities have no monopoly on publishing research. Research is published by a large, diverse network of organisations. These organisational networks form two interacting groups of institutions. One group is highly visible and well-studied (industry, higher education and government) and the other group is less visible and less well understood (hospitals, medical research institutes, research council laboratories and non-profit). Thus, our bibliometric study suggests that the UK scientific community is composed of two sets of users (sites of application) and sources of technical opportunity (sites of innovation).

We begin by looking at the most basic indicator in the database, the number and percentage of papers published (participated in) by each sector in each year. All sectors participated in more papers in 1994 than in 1981, and the total number of papers increased by 38%. The educational, medical and non-profit sectors increased their share of UK publishing between 1981 and 1994, while research councils and government laboratories lost share and industry held steady. Education (including schools and colleges) is the single largest sector, but they do not comprise the entire science system.

interactions between institutions. Systemic indicators are needed.

Table 1
Number and Percentage of UK papers by institutional sector

Number of papers published


Sector      1981  1982 1983  1984 1985  1986  1987  1988  1989 1990  1991  1992  1993  1994 Total 
                                                                                                  

Education 18461 18926 19118 19065 20589 20145 20668 20580 21186 22135 23022 24912 25687 27686 302180 
                                                                                           
Medical    7576  7771  8006  8277 9290  9212  9344  9686 10242 10737 10688 11412 11417  11634 135292 
                                                                                                

Research  3569  3673 3936  3704 4041  4269  3959  3804  3857 4123  4054  4345  4398  4368 56100 
Council                                                                                           

Industry  2547  2582 2710  2606 2771  2682  2660  2936  3001 3160  3180  3412  3351  3380 40978 

Government 1397  1260 1340  1236 1322  1355  1272  1255  1258 1348  1384  1457  1490  1523 18897 

Non-profit 484   556  570   632  621   633   628   683   708  844   848   958   937  1018 10120 

Unknown    108   140  136   142  135   104    92   110   114   91   162   179   156   217  1886 

Total     31167 31746 32425 32142 34875 34355 34227 34581 35677 37165 37866 40503 41001 43050 500780 



Percentage of UK publications
Sector      1981 1982  1983 1984  1985  1986  1987  1988  1989  1990  1991 1992  1993  1994 Total 
                                                                                                  
Educational 59.2 59.6  59.0 59.3  59.0  58.6  60.4  59.5  59.4  59.6  60.8 61.5  62.6  64.3  60.3 

Medical     24.3 24.5  24.7 25.8  26.6  26.8  27.3  28.0  28.7  28.9  28.2 28.2  27.8  27.0  27.0 

Research    11.5 11.6  12.1 11.5  11.6  12.4  11.6  11.0  10.8  11.1  10.7 10.7  10.7  10.1  11.2 
Council                                                                                           

Industry     8.2  8.1   8.4  8.1   7.9   7.8   7.8   8.5   8.4   8.5   8.4  8.4   8.2   7.9   8.2 

Government   4.5  4.0   4.1  3.8   3.8   3.9   3.7   3.6   3.5   3.6   3.7  3.6   3.6   3.5   3.8 

Non-profit   1.6  1.8   1.8  2.0   1.8   1.8   1.8   2.0   2.0   2.3   2.2  2.4   2.3   2.4   2.0 

Unknown      0.3  0.4   0.4  0.4   0.4   0.3   0.3   0.3   0.3   0.2   0.4  0.4   0.4   0.5   0.4 

Total      109.4 110.0 110.5 111 111.2 111.8 112.8 112.9 113.1 114.2 114.5 115.2 115.7 115.7 112.9 


Table 2 displays the number and percentage of institutionally collaborative papers for all UK sectors. The table indicates the strong and steady rise in research collaboration in the UK system. Every sector saw strong growth in the share of output produced collaboratively.

Table 2
Number and Percentage of institutionally collaborative UK papers

Number of collaborative papers


Sector      1981 1982  1983 1984  1985  1986  1987  1988 1989  1990  1991  1992  1993 1994  Total 
                                                                                                  

Educational 5728 6050  6244 6598  7263  7537  8111  8341 8834  9763 10562 12183 12626 13814 123654 
                                                                                                 
Medical     3016 3101  3269 3466  3887  3924  4116  4296 4674  4997  5102  5749  5942 6108  61647 

Research    1315 1418  1573 1622  1685  1887  1961  1964 2035  2303  2348  2646  2703 2851  28311 
Council                                                                                           

Industry     899  979  1080 1045  1156  1173  1340  1426 1522  1705  1809  1976  2063 2125  20298 

Government   432  404   442  453   485   527   520   565  568   685   709   772   815  833   8210 

Non-profit   204  252   274  329   328   329   358   391  412   506   512   647   606  718   5866 

Unknown       40   45    46   66    60    43    48    57   61    46   102   103    98  141    956 

Total       8659 9088  9537 10059 10970 11375 12058 12567 13417 14732 15674 17904 18418 19814 184272 

Percentage of Sector Papers
Sector      1981 1982  1983 1984  1985  1986  1987  1988 1989  1990  1991  1992  1993 1994  Total 
                                                                                                  

Educational   31   32    33   35    35    37    39    41   42    44    46    49    49   50     41 

Medical       40   40    41   42    42    43    44    44   46    47    48    50    52   53     46 

Research      37   39    40   44    42    44    50    52   53    56    58    61    61   65     50 
Council                                                                                           

Industry      35   38    40   40    42    44    50    49   51    54    57    58    62   63     50 

Government    31   32    33   37    37    39    41    45   45    51    51    53    55   55     43 

Non-profit    42   45    48   52    53    52    57    57   58    60    60    68    65   71     58 

Unknown       37   32    34   46    44    41    52    52   54    51    63    58    63   65     51 

Total         28   29    29   31    31    33    35    36   38    40    41    44    45   46     37 


Figure 1 illustrates this growth in collaborative publishing, comparing it with the slight decline in non-collaborative output. Both numbers and shares of papers are displayed for each year, and each set of data is accompanied by regression lines (both linear and exponential) carried forward to identify the point at which the lines will cross. When the lines cross, well before the year 2000, more than 50% of UK scientific papers will be produced by researchers collaborating across two or more institutions. Collaboration increased steadily during the 1980s (in particular international collaboration has been increasing at a constant rate for over 20 years). This suggests that internal dynamic processes are reshaping the British research system, so that soon more research will be produced in institutional networks than is produced by institutions alone. Collaboration will become the rule not the exception.

Networked knowledge production is often seen as one characteristic of the knowledge-based society. Gibbons et al. have argued that socially dispersed knowledge production is another such characteristic. They argue that the initiation of mass higher education after World War II produced large numbers of PhD's, not all of whom could find jobs in academia. Therefore, PhD's are found working all through society, dispersing research capability in the process. They see:

An increase in the number of potential sites where knowledge can be created; no longer only universities and colleges, but non-university institutes, research centres, government agencies, industrial laboratories, think-tanks, consultancies, in their interaction (Gibbons et al. 1994, 6).

Others see the research system as becoming more concentrated, pointing to very powerful forces of competition based on excellence that are endogenous to science and that lead to concentration of resources over time. In addition, closer management of research is leading to even more concentration to achieve administrative efficiency, economies of scale and division of labour. 'Selectivity' is increasingly part of managing science. "What selectivity has come to mean in practice is a systematic policy of concentrating research activity into a smaller number of more specialised units" (Ziman 1994, 156).

We can examine bibliometrically whether the UK science system is becoming more dispersed or more concentrated. Our data permit us to look at several facets of the concentration/dispersion phenomenon. We can analyse the types of institutions publishing research to see whether a more varied set of institutions are participating in the system. We can also examine the number of institutions publishing to look for increases. Finally, we can look at concentration measures used by economists to see if research production is becoming more evenly distributed amongst institutions.

We begin by examining the types of UK institutions that publish, or rather the types of institutions that publish articles, notes or reviews in journals included in the Science Citation Index. As discussed earlier, the SCI as analysed here represents international peer-reviewed science - the domain of academics. Why then do 50% of papers list the addresses of non-educational institutions? Why do hospitals and firms account for 63% of the institutions that averaged one or more papers per year? Why did ICI, Wellcome, SmithKline Beecham and BT each contribute more than 1000 papers over the 1980s? Nevertheless, that so much international-level, peer reviewed, scientific knowledge in the UK is produced outside the university sector, or in collaboration with institutions other than universities, suggests that universities do not have a monopoly on "academic" research. Medical institutions, industrial laboratories, research council and other government laboratories and non-profit institutes collectively seem to be as important as universities in the modern UK research system. Many of these institutions, such as local councils and police forces, are not normally thought of as contributing to Britain's scientific research output.

We tend to think that scientific papers are produced by large, well-known institutions. Participation in the science base is in fact much broader, even though the large institutions do account for most of the output. We can explore the breadth of participation in the system by counting the number of institutions publishing and by breaking this count down into large and small publishers. If we use the number of papers published by an institution as a measure of "publishing size", we can also ask: which sectors are characterised by large publishers and which by small?

Table 3 sets out to answer these questions. For each sector, the first two columns report the number of papers published from 1981 to 1994 and the percentage of UK output this represents. The next column reports the number of unique institutional names publishing in each sector and the last column reports the number and percentage of 'big' publishers. A 'big' publisher is here defined as one publishing 10 or more papers from 1981 to 1994. Educational institutions account for most of the papers - 60%. However, industry accounts for largest share of publishing institutions - 44%. While medical institutions account for 50% of the 1278 institutions publishing more than 10 papers. We can also see that research council laboratories are primarily 'big' institutions. In contrast, about 90% of the companies publishing papers during this period produced less than 10.

Gibbons et al. not only point to this variety in the research system, they also claim it has been increasing. To investigate this, we can ask whether the number of participants in the research system has been increasing. Figure 2 displays for each sector the number of institutions publishing for each year 1981 to 1994. In the graph, publishing institutions are classified into four categories according to whether they published on average 1 paper per year, 2-10 papers, 11 to 100 or more than 100. (Only Oxford and Cambridge published more than 1000 papers per year.) The graph reveals that in this area, sectors have very different characteristics. For example, many hospitals or companies published just one paper or less per year while the educational sector is more evenly balanced with universities publishing more than 100 paper per year, new universities publishing 11-100 papers per year and schools and colleges publishing 1-10 papers per year.

The graph shows that the number of publishing institutions did increase in all sectors but one. The numbers on the beginning and end of each sector's graph report the number of publishing institutions in the first and last years. If we take the difference between these numbers as a measure of the size of the change, the number of industrial publishers increased by 60%, and the number of medical publishers increased by 39%. The number of non-profit publishers increased by 34%. The number of government institutions publishing varied over the decade, though the number in the final year was 25% higher than in the first year. The exception to the increase in number of publishing institutions is Research Councils whose numbers decreased by 14%. The decrease is consistent with concentration as the number of medium sized (11-100) publishers decreased and the number of large (>100) publishers increased.

The increasing number of publishing institutions lends support to the idea that research production is becoming more dispersed. Only the concentration among research council laboratories lends support to the idea that more management of research is leading to consolidation. The weight of evidence favouring dispersion reinforces the point that academic research forms only half the publishing system in the UK today. In fact, it forms the static half. The number of publishing institutions of other types, such as companies, hospitals and non-profits has grown.

The number of publishing institutions is not by itself an adequate measure of concentration however. Concentration has two dimensions, both number of organisations and the size inequality between them. The distribution of research production across institutions has always been uneven, with a few producing a great deal, and a large number producing not much. While forces both traditional and internal, and new and external produce this distribution, some suggest that the distribution might be becoming more even (Gibbons et al.).

Figure 2 displayed the number of institutions publishing, qualitatively conveying differences in institutional publishing size. To analyse concentration more quantitatively, and examine whether there were any changes over the decade, we will examine measures of concentration (Table 4). As there are various measures of concentration, none ideal in all circumstances, we present two here, both of which reflect the two dimensions of concentration (Davies et al. 1988, 79-86). The first measure is the number of firms publishing 25%, 50% etc. of the papers which is easy to understand and conveys the nature of the tail in the distribution for each sector. The second measure is the Herfindahl index, which being a single number, makes it possible to assess change in concentration over time. This index is calculated by summing the squares of each institution's share of the sector's publication (it's maximum value is one). For example, if there were two institutions in a sector producing 95% and 5% of the papers respectively, the Herfindahl index would be: 0.952 + 0.052 = 0.905. In a sector of size n, concentration reaches a minimum when all institutions publish the same number of papers. At this minimum, the Herfindahl index would equal 1/n where n is the number of institutions. By reversing this logic, each sector can be seen to be as concentrated as a sector of 1/Herfindahl index number of equal sized institutions. This number of institutions is reported in the last line of the tables. There are three tables in the figure. The first two report a three yearly average, in 1981-1983 and 1992-1994 respectively. The final table reports the difference between the figures in the first two tables. The columns are ordered by how concentrated the sectors were in the 1992-1994 period. < /FONT >

The tables indicate that the sectors vary somewhat in the degree of concentration. The non-profit sector is the most concentrated while the medical sector is the least concentrated. Sector concentrations changed somewhat with four sectors becoming more concentrated: research councils, non-profit, educational and medical and two becoming less concentrated: government and industry. In 1981, industry was more concentrated than education, but in 1994 the reverse was true. Thus there is no uniform trend towards concentration or dispersion.

Data sources

There are many databases indexing the scientific and technical literature: Chemical Abstracts, Medline, Biosis, Forestry Abstracts, Physics Abstracts to name but a few. Bibliometric indicators are primarily based on one: the Science Citation Index (SCI) produced by the Institute for Scientific Information (ISI) in Philadelphia, USA. This section explains why the SCI is so heavily relied upon and its advantages and disadvantages.

The advantages are, first, coverage. All science fields are indexed in the SCI- a necessity if one is looking at whole research systems. In addition to being comprehensive, SCI coverage is unambiguous because every item from every journal is indexed. Coverage in other databases is ambiguous for indicator purposes because although they include all items from core journals, only items considered relevant to the subject of the database are included from other journals. There are 40-50,000 journals; of these the Institute for Scientific Information considers 10-12,000 to have some minimum quality. More than 90% of the citations in these journals are to a more limited set of 3-4000 journals. These are indexed in the SCI. Thus, the SCI covers literature seen as important by researchers. Furthermore, the SCI's wide use for indicators means that its coverage has been well studied.

The second advantage is that all addresses listed on the paper are included in the SCI - a necessity for studying institutional output as collaboration is so extensive. Only first addresses are included in other databases, and so papers on which an institution's address was not listed first cannot be credited to the institution. This source of error is substantial and growing as the rate of institutional collaboration increases. Only the first address is needed to contact authors of a paper, so first address listing is not a problem from the perspective of the main database users, scientists searching for literature. From the policy perspective, the address that happens to be listed first is a social artefact and not of great policy interest in comparison to the total output of the institution. Of course, only if all addresses are listed can collaboration be studied.

The third advantage is that references are included in the SCI and only the SCI. Citations are a partial indicator of the impact research has had on succeeding work. Citation counts are such a useful adjunct to policy analysis that almost by themselves their presence would justify using the SCI for policy analysis.

Coverage and cost are the disadvantages of the SCI. Because it indexes all science, its coverage of any single area will not be as in depth as specialist databases such as Medline, Chemical Abstracts, or Biosis. However, often a higher percentage of an institutions papers in, for example, chemistry, may be found in SCI than in Chemical Abstracts because the SCI lists all addresses (Russell et al. 1995). Thus, more comprehensive subject coverage does not necessarily equate to superior retrieval for institutions.

The database is also relatively costly to use being produced by a private company. In comparison, patent databases are produced by government agencies and thus the American data are available for the media cost. Any large scale development of bibliometric indicators would have to budget several hundred thousand dollars to obtain the data which would be usable under a license subject to copyright and intellectual property restrictions.

Method for producing systemic bibliometric indicators

Indicators can be produced off a database at various levels: the database as a whole, nations, institutions, departments or individuals. Movement from one level to the next level down entails an increase in difficulty and in the computational requirement of more than an order in magnitude because the databases were set up to serve scientists searching for literature not policy analysis. "Raw" databases are suitable for some types of analysis, for example by journal or by year, because journal names and years are controlled terms kept standard, so simply counting occurrences of particular journal names in a particular year suffices. Unfortunately, these easy counts of databases are of little policy interest. National indicators, being of more interest, are well established, as mentioned earlier. However, they can only be produced today because many years of development were undertaken. Originally country names were not standardised because they were not crucial to the database users, scientists searching for literature. Thus natural variety and errors meant that fairly sophisticated searching was needed to count, for example all UK papers (i.e. from England, Scotland, Wales, Northern Ireland, UK, or Britain but not New England, New South Wales etc.). However, country names are now standardised and the techniques for producing reliable national counts are well known. Thus national counts can be produced.

In June 1992, the Science Policy Research Unit at University of Sussex launched the Bibliometric Evaluation of Sectoral Scientific Trends (BESST) project. Its aim was to advance the state of the art by moving down one level, from the national to the institutional on a substantial slice of the world literature (8%). More specifically, the objectives were (a) to determine the share of national scientific output in various scientific fields contributed by different institutional sectors ( e.g. universities, industry, research councils, government laboratories, hospitals, etc.), (b) to map the changes during the 1980s in patterns of inter- and intra-sectoral collaboration in different scientific fields, (c) to investigate changes in the patterns of international collaboration with UK institutions, and (d) to use the data to investigate policy-relevant questions.

In order to meet these objectives variations of each institutional name recorded in the SCI had to be unified to a standard name and then each standard name had to be assigned to an institutional sector. This problem involved the manipulation of hundreds of megabytes of original SCI bibliographic text data, the development of techniques to construct a thesaurus of variant and standard institutional names and the design of software to use the thesaurus to produce a unified data set.

Before the process of constructing the institutional names thesaurus could begin so that the data could be unified, an intermediate data set was produced from the raw SCI tape data. Only the journal, authors and corporate addresses were retained while such things as the title, cited references, page and volume numbers were ignored. However, the unique six character ISI article identifier was retained to provide a link back to the original data. This new data set was used to construct the institutional name thesaurus. Each corporate address in the intermediate data set contained five major fields separated by a slash but since we were only interested in UK publications, only four of these fields were extracted (the fifth field typically contains the zip code for US institutions). An example of the contents of each field is listed below:


Field  Contents            Example                                         

1      Institute, dept,    QUEENS UNIV BELFAST,CTR MED BIOL,DEPT PHARM,97  
       etc.                LISBURN RD                                      

2      City-postcode       BELFAST BT9 7BL                                 

3      County              ANTRIM                                          

4      Country             NORTH IRELAND                                   


The corporate address list was generated using the following criteria:1. Only corporate addresses from article, note and review publication types containing the keywords England, Scotland, Wales or North Ireland in the country field were selected. 2. The county field was not used.3 The city name was extracted from the city-postcode field where possible,. 4. In general only the first, or the first and second sub-field from the first field was used. However, if certain keywords such as MRC, AFRC, NERC, or SERC (or a variation of a keyword) appeared anywhere within the first field, all four sub-fields were used when possible.

A list of 'dirty' institutional names ordered alphabetically by city and institutional name was created from the intermediate data set. The laborious unification process involved manually working through this list and assigning each 'dirty' institutional name to a 'clean', definitive name. The result was a thesaurus of 'clean' institutional names (prefaced with a %), each assigned to a sector, and linked to all the associated variants or 'dirty' institutional names. The thesaurus was sorted alphabetically by city (prefaced with a #), the general layout being as follows:

  • # City name
  • % Clean name:sector
    • 1st dirty variant
    • 2nd dirty variant
    • ..
    • nth dirty variant

The following example is taken from the 1991 institutional thesaurus.

  • # LONDON
  • % ACER CONSULTANTS LTD:I
    • ACER CONSULTANTS LTD
  • % BETHLEM ROYAL HOSP & MAUDSLEY HOSP:S
    • BETHLEHEM ROYAL HOSP
    • BETHLEM MAUDSLEY HOSP
    • MAUDSLEY & BETHLEM ROYAL HOSP
    • BETHLEM ROYAL & MAUDSLEY HOSP MAUDSLEY HOSP & INST PSYCHIAT
    • BETHLEM ROYAL HOSP & MAUDSLEY HOSP
  • % UNIV LONDON,IMPERIAL COLL:U
    • IMP COLL
    • IMPERIA COLL
    • IMPERIAL COLL IMPERIAL COLL CTR ENVIRONM TECHNOL
    • IMPERIAL COLL SCI & MED
    • IMPERIAL COLL SCI & TECHNOL & MED
    • IMPERIAL COLL SCI MED & TECHNOL
    • UNIV LONDON,IMPERIAL COLL
    • UNIV LONDON,IMPERIAL COLL SCI TECHNOL & MED
    • UNIV LONDON IMPERIAL COLL SCI & TECHNOL
    • IMPERIAL COLL SCI TECHNOL & MED
    • IMPERIAL COLL SCI TECHNOL & MED
    • UNIV LONDON IMPERIAL COLL SCI & TECHNOL

The 1981 thesaurus was produced first and took about 4 weeks to prepare. It contained about 2500 clean institutional names and 4300 variant names in 840 cities and towns. This thesaurus was then used to pre-process the 1982 data resulting in a semi-clean thesaurus which reduced the manual procedure by about 30%. By the time the 1994 thesaurus was pre-processed using the 1981-93 thesauruses, the manual task had been reduced by 50-60% and unification took about 10 days. In total, over 30,000 variants were processed to yield 5,900 publishing institutions.

Some general rules were adopted for unifying the UK institutional names and they are listed as follows:

1. Each clean institution was assigned to one of the following sectors: medical, educational, research council (see point 4 below), industry, government (national and local), non-profit and unknown.

2. In a few cases, it was not possible to determine what 'clean' institutional name to assign to a 'dirty' name but it was possible to infer from a keyword(s) in the 'dirty' name what sector it belonged to. In such cases, a clean institutional name of unknown was used and given an appropriate sector designator.

3. In the UK, many public sector organisations were privatised during the 1980s. It is difficult to determine whether the publication under examination was submitted to a journal for publication before, during or after privatisation. Therefore, all of these organisations (e.g. research associations, water authorities, etc.) were assigned to the industry sector for the entire period.

4. UK research councils support both intramural research institutions located on university campuses or hospitals and research groups located within university or hospital departments. Distinguishing the two was difficult. An effort was made to assign all publications containing the keywords MRC, AFRC (or ARC), NERC, and SERC to the appropriate research council institution. However, if an MRC paper did not contain the keyword UNIT or contained the keyword GRP (abbreviation for group) it was assigned to the institution whose name appeared in the first sub-field (usually a hospital or university). All AFRC publications that did not designate an AFRC institution but contained the keywords UNIT or GRP were likewise assigned to the institution named in the first sub-field (usually a university).

5. Implicit in the unification procedure was the assumption that we should not attempt to second guess an author's institutional affiliation. Therefore, papers that named two institutions in either the first sub-field or the first and second sub-fields (e.g. HOSP X & Y; or UNIV X, HOSP Y; or HOSP Y, UNIV X) were assigned to the institutional name that appeared first. For example, if the first sub-field contained the name of a university and the second sub-field the name of a hospital, the paper was assigned to the university. Similarly, if a hospital name appeared first and a university name second then the paper was assigned to the hospital.

Difficulties of regularly producing systemic bibliometric indicators for a range of countries

The difficulty of unifying name variants has several implications for any attempt to regularly publish institutional level indicators for several countries. Firstly, it will be expensive. The cost of data combined with the labour and capital expenditure for computers totals to a substantial sum, greater than that available, for example, for social science research under the EU Framework programme. Secondly, ongoing unification will be needed, a process requiring two weeks for the UK (excluding computer processing time). The first round, in which software and procedures would be developed, training undertaken, and a decade or more of data established would take three to four years. Thirdly, central quality control mechanisms would be needed. The complexity and high manual component mean that all work must be checked for consistency to insure compliance with agreed unification conventions and to eliminate the inevitable errors. Quality control is essential if data are to be consistent across countries and over time - i.e. if the data are to be usable. Such checking should be performed centrally to ensure that data from all countries is of the same quality. This suggests that international co-ordination and guidance is essential.

A second class of difficulties are conceptual. First, the relationship between addresses and institutions is not entirely straightforward. The technique assumes that addresses indicate the institutional affiliation of authors. This may not be true. For example, in France the address of a researcher may be a university but the institutional affiliation may be CNRS. The address "Cavendish Laboratory" is often given meaning "Cambridge University, Physics Department". Alternatively, independent institutes may be located on university campuses, for example the consulting company "Institute for Employment Studies" is in the same building as SPRU, which is a department of the University of Sussex.

Second, institutions change, but time series data assume they remain the same. Some universities in the UK have had three names in the last 10 years. Government laboratories have been privatised and consolidated. Companies merge, demerge and acquire.

Thirdly, an institution may not always be clearly assigned to one sector. Indicators would be developed at the sectoral level and the technique assumes that institutions can be assigned to one of the following sectors: medical, educational, research council, industry, non-profit or government. In the UK new institutions seem to be appearing that get core funding from several sources - governmental, industrial and charity for example. These institutions transcend the sectoral boundaries as traditionally defined. Luckily, few exist at the moment. The most pervasive problem in institutional and sector assignment is medical schools. Clinical researchers have dual university-hospital affiliations; there are two streams of funding and medical schools (in the UK at least) can be departments of universities or hospitals. Separating the two is not just a problem of bibliometric method, clinicians are not clear about which stream of money paid for what themselves. In the US, this has never proved a problem. Research hospitals are components of universities. In the UK however, calling National Health Service (NHS) hospitals "universities" is inaccurate and discounts the large (if hitherto invisible) contribution made to the UK science base by NHS research funding. We resolved the dilemma with the following rules which are based on the principle that we do not second guess the author of the paper:1. As we unified to the institutional not the departmental level, medical schools as departments were unified to their institution - hospital or university.2. If an author lists hospital and university addresses on one line, as one address, the paper was assigned to the first affiliation.3. If an author lists hospital and university affiliations as two separate addresses on two lines, the paper is counted as collaborative between the hospital and university.

The conceptual difficulties of unification, namely complex and changing institutional structures and multiple sector affiliations, have several consequences for multi-national indicator development. First, the process will only be possible in countries where addresses reflect institutional affiliation to a reasonable degree. Second, national experts must oversee unification. Only local knowledge brought to bear on institutional complexity will produce sound data. Third, no single sector classification will suit all countries. At this point the best solution would seem to be two levels of sector classification. A more detailed level designed to meet national policy interest and an internationally negotiated higher level aggregation designed for international indicator use.

Finally, the project would seem to need governance by an international institution. International co-ordination is required to produce compatible national datasets and to raise the amount of money needed. Consistency checks and quality control need to be performed at the international level as mentioned earlier as does regular indicator production. In addition if a master copy of the database could be housed somewhere, it could be made available to all interested researchers. In this case, the database would form an infrastructure useful not just for high level indicators, but also for a wide range of international analyses and evaluation studies. Breadth of access would increase the returns on the investment required to produce the database.

Summary

By producing the database foundation for systemic bibliometric indicators, the science policy community would enter the emerging area of data mining. The time is ripe to do so. The technology is available, the need is clear and the techniques have been demonstrated. Constructed as an open-access infrastructural support for indicator, evaluation, analytical and policy work, the community could use such a database to the full - advancing indicator, data mining and visualisation technology in the process.

References

Davies, S. Bruce Lyons, H. Dixon, and P. Geroski. 1988. Economics of Industrial Organisation. Harlow: Longman.

European Commission. 1994. The European Report on Science and Technology Indicators 1994. EUR 15897. Luxembourg: Office for Official Publications of the European Communities.

Gibbons, M. C. Limoges, H. Nowotny, S. Schwartzman, P. Scott and M. Trow. 1994. The New Production of Knowledge. London: Sage.

Hicks, D.M., T. Ishizuka, P. Keen & S. Sweet, 1994, Japanese Corporations, Scientific Research and Globalisation. Research Policy, 23: 375-384.

Hicks, D. 1995. Published Papers, Tacit Competencies and Corporate Management of the Public/Private Character of Knowledge. Industrial and Corporate Change. 4:401-424.

Katz, J. S., D. Hicks, M. Sharp, and B. R.Martin, October 1995. The Changing Shape of British Science. STEEP Special Report No.3. Brighton,UK:Science Policy Research Unit.

Katz, J. S. and D. Hicks. 11 May 1995. Questions of Collaboration. Nature. 375:99.

Katz, J.S. and B.R. Martin. 1996. What is Research Collaboration? Research Policy. In Press.

Kuhn, T.S. 1962. The Structure of Scientific Revolutions, Chicago: University of Chicago Press.

Latour, B. 1987. Science in Action. Milton Keynes: Open University Press.

Leclerc, M., Y. Okubo, L. Frigoletto and J-F Miquel. 1992. Scientific Co-Operation Between Canada and the European Community. Science and Public Policy. 19:15-24.

Lundvall, B.A. (ed.) 1992. National Systems of Innovation. London: Pinter.

Luukkonen, T. et al. 1992. Understanding Patterns of International Scientific Collaboration. Science, Technology and Human Values. 17:101-126.

Narin, F. and E.S. Whitlow. May 1990. Measurement of Scientific Co-Operation And Coauthorship in Cec-Related Areas of Science. Vol. 1. Report to the Commission of the European Communities. New Jersey: CHI Research.

National Science Board, Science & Engineering Indicators, Washington DC, USGPO, various years.

Nelson, R.R (ed.) 1993. National Innovation Systems. New York: Oxford University Press.

OECD. 1992. Technology and the Economy: the Key Relationships. Paris: OECD.

Russell, J.M., A.Ma. Rosas & R. Arvanitis. 1995. "Institutional production cutting across disciplinary boundaries," Proceedings of the Fifth Biennial Conference of the International Society for Scientometrics and Informetric.s Medford, NJ: Learned Information Inc. pp. 485-493.

Schubert, A. and T. Braun. 1990. International Collaboration in the Sciences. Scientometrics. 19:3-10.

Ziman, J. 1994. Prometheus Bound: Science in a Dynamic Steady State. Cambridge: Cambridge University Press.


Figure 1: Collaboration (return)


Figure 2: Number of institutions publishing each year, 1981-94, by sector (return)


Table 3: Number of institutions in each sector (return)


Table 4: Concentration measures (return)

(main page)