Translate

Search This Blog

Total Pageviews

Saturday, February 13, 2010

DBMS's

The area of database management systems is a microcosm of computer science in general. The issues addressed and the techniques used span a widespectrum, including languages, object-orientation and other programming paradigms, compilation, operating systems, concurrent programming, data-structures, algorithms theory, parallel and distributed systems, user interfaces, 
expert systems and artificial intelligence, statistical techniques, and
dynamic programming. We cannot go into all these aspects 
of database management in one place, but we hope to
give the reader a sense of the excitement in   
this rich and vibrant discipline.
Johannes Gehrke
    Cornell University
 Ithaca, New York, USA                                                     
Raghu Ramakrishnan
   University of Wisconsin
  Madison, Wisconsin, USA


  Has everyone noticed that all the letters of the word database are typed with
the left hand? Now the layout of the QWERTY typewriter keyboard was 
designed,among other things, to facilitate the even use of both hands. 
It follows, therefore,that writing about databases is not only 
unnatural, but a lot harder than it appears.
                                                                      ---Anonymous


FOUNDATIONS



OVERVIEW OF DATABASE SYSTEMS

A database is a collection of data, typically describing the activities of one or more related organizations. For example, a university database might contain information about the following:
  • Entities (A DBMS entity is either a thing in the modeled world or a drawing element in an ERD) such as students, faculty, courses, and classrooms.
  • Relationships between entities, such as students' enrollment in courses, faculty teaching courses, and the use of rooms for courses.
A database management system, or DBMS, is a software designed to assist in maintaining and utilizing large collections of data. The need for such systems, as well as their use, is growing rapidly. The alternative to using a DBMS is to store the data in files and write application-specific code to manage it. The use of a DBMS has several important advantages, as we will see.


MANAGING DATA

   Not surprisingly, many decisions about how to use a DBMS for a given application depend on what capabilities the DBMS supports efficiently. Therefore, to use a DBMS well, it is necessary to also understand how a DBMS work.
   Many kinds of database management systems are in use, but here we concentrate on relational database systems (RDBMSs), which are by far the dominant type of DBMS today. The following questions are addressed here:
  1. Database Design and Application Development: How can a user describe a real-world enterprise (e.g., a university) in terms of the data stored in a DBMS? What factors must be considered in deciding how to organize the stored data? How can ,we develop applications that rely upon a DBMS?
  2. Data Analysis: How can a user answer questions about the enterprise by posing queries over the data in the DBMS?
  3. Concurrency and Robustness: How does a DBMS allow many users to access data concurrently, and how does it protect the data in the event of system failures?
  4. Efficiency and Scalability: How does a DBMS store large datasets and answer questions against this data efficiently?

A HISTORICAL PERSPECTIVE

From the earliest days of computers, storing and manipulating data have been a major application focus.
  1. The first general-purpose DBMS, designed by Charles Bachman at General Electric in the early 1960s, was called the Integrated Data Store. It formed the basis for the network data model, which was standardized by the Conference on Data Systems Languages (CODASYL) and strongly influenced database systems through the 1960s. Bachman was the first  recipient of ACM's Turing Award (the computer science equivalent of a Nobel Prize) for work in the database area; he received the award in 1973.
  2. In the late 1960s, IBM developed the Information Management System (IMS) DBMS, used even today in many major installations. IMS formed the basis  for an alternative data representation framework called the hierarchical data model.   
  3. The SABRE system for making airline reservations was jointly developed by American Airlines and IBM around the same time, and it allowed several people to access the same data through a computer network. Interestingly, today the same SABRE system is used to power popular Web-based travel services such as Travelocity.
  4. In 1970, Edgar Codd, at IBM's San Jose Research Laboratory, proposed a new data representation framework called the relational data model. This proved to be a watershed in the development of database systems: It sparked the rapid development of several DBMSs based on the relational model, along with a rich body of theoretical results that placed the field on a firm foundation. Codd won the 1981 Turing Award for his seminal work. Database systems matured as an academic discipline, and the popularity of relational DBMSs changed the commercial landscape. Their benefits were widely recognized, and the use of DBMSs for managing corporate data became standard practice.
  5. In the 1980s, the relational model consolidated its position as the dominant DBMS paradigm, and database systems continued to gain widespread use. The SQL query language for relational databases, developed as part of IBM's System R project, is now the standard query language. SQL was standardized in the late 1980s, and the current standard, SQL:2008, was adopted by the American National Standards Institute (ANSI) and International Organization for Standardization (ISO). Arguably, the most widely used form of concurrent programming is the concurrent execution of database programs (called transactions). Users write programs as if they are to be run by themselves, and the responsibility for running them concurrently is given to the DBMS. James Gray won the 1999 Turing award for his contributions to database transaction management.
  6. In the late 1980s and the 1990s, advances were made in many areas of database systems. Considerable research was carried out into more  powerful query languages and richer data models, with emphasis placed on supporting complex analysis of data from all parts of an enterprise. Several vendors (e.g., IBM's DB2, Oracle 8, Informix2, UDS--a network model database management system by Siemens) extended their systems with the ability to store new data types such as images and text, and to ask more complex queries. 
  7. Specialized systems have been developed by numerous vendors for creating data warehouses, consolidating data from several databases, and for carrying out specialized analysis. An interesting phenomenon is the emergence of several enterprise resource planning (ERP) and management resource planning (MRP) packages, which add a substantial layer of application-oriented features on top of a DBMS. Widely used packages include systems from Baan, Oracle, PeopleSoft, SAP, and Siebel. These packages identify a set of common tasks (e.g., inventory management, human resources planning, financial analysis) encountered by a large number of organizations and provide a general application layer to carry out these tasks. The data is stored in a relational DBMS and the application
    layer can be customized to different companies, leading to lower overall costs for the companies, compared to the cost of building the application layer from scratch. 
  8. Most significant, perhaps, DBMSs have entered the Internet Age. While the first generation of websites stored their data exclusively in operating  systems files, the use of a DBMS to store data accessed through a Web browser is becoming widespread. Queries are generated through  Web-accessible forms and answers are formatted using a markup language  such as HTML to be easily displayed in a browser. All the database vendors are adding features to their DBMS aimed at making it more suitable for deployment over the Internet. Database management continues to gain importance as more and more data is brought online and made ever more accessible through computer networking. 
  9. Today the field is being driven by exciting visions such as multimedia databases, interactive video, streaming data, digital libraries, a host of scientific projects such as the human genome mapping effort and NASA's Earth Observation System project, and the desire of companies to consolidate their decision-making processes and mine their data repositories for useful information about their businesses (data mining). Commercially, database management systems represent one of the largest and most vigorous market segments. Thus the study of database systems could prove to be richly rewarding in more ways than one!
      
    FILE SYSTEMS VERSUS A DBMS

    To understand the need for a DBMS, let us consider a motivating scenario:
    A company has a large collection (say, 500 GB) of data on employees, departments, products, sales, and so on. This data is accessed concurrently by several employees. Questions about the data must be answered quickly, changes made to the data by different users must be applied consistently, and access to certain parts of the data (e.g., salaries) must be restricted.
    We can try to manage the data by storing it in operating system files. This approach has many drawbacks, including the following:
    • We probably do not have 500 GB of main memory to hold all the data. We must therefore store data in a storage device such as a disk or tape and bring only relevant parts into main memory for processing as needed.
    • Even if we have 500 GB of main memory, on computer systems with 32-bit addressing, we cannot refer directly to more than about 4 GB of data(instead with 64bit addressing one can address 18446.74 PB) . We have to program some method of identifying all data items.
    • We have to write special programs to answer each question a user may want  to ask about the data. These programs are likely to be complex because of the large volume of data to be searched.
    • We must protect the data from inconsistent changes made by different users accessing the data concurrently. If applications must address the details of such concurrent access, this adds greatly to their complexity.
    • We must ensure that data is restored to a consistent state if the system crashes while changes are being made.
    • Operating systems provide only a password mechanism for security. This is not sufficiently flexible to enforce security policies in which different users have permission to access different subsets of the data.
    A DBMS is a piece of software designed to make the preceding tasks easier. By storing data in a DBMS rather than as a collection of operating system files, we can use the DBMS's features to manage the data in a robust and efficient manner.
    As the volume of data and the number of users grow hundreds of gigabytes of data and thousands of users are common in current corporate databases DBMS support becomes indispensable.

    ADVANTAGES OF A DBMS

    Using a DBMS to manage data has many advantages:
    • Data Independence: Application programs should not, ideally, be exposed to details of data representation and storage, The DBMS provides an abstract view of the data that hides such details.
    • Efficient Data Access: A DBMS utilizes a variety of sophisticated techniques to store and retrieve data efficiently. This feature is especially important if the data is stored on external storage devices.
    • Data Integrity and Security: If data is always accessed through the DBMS, the DBMS can enforce integrity constraints. For example, before inserting salary information for an employee, the DBMS can check that the  department budget is not exceeded. Also, it can enforce access controls that govern what data is visible to different classes of users.
    • Data Administration: When several users share the data, centralizing the administration of data can offer significant improvements. Experienced professionals who understand the nature of the data being managed, and how different groups of users use it, can be responsible for organizing the data representation to minimize redundancy and for fine-tuning the storage of the data to make retrieval efficient.
    • Concurrent Access and Crash Recovery: A DBMS schedules concurrent accesses to the data in such a manner that users can think of the data as being accessed by only one user at a time. Further, the DBMS protects users from the effects of system failures.
    • Reduced Application Development Time: Clearly, the DBMS supports  important functions that are common to many applications accessing data in the DBMS. This, in conjunction with the high-level interface to the data, facilitates quick application development. DBMS applications are also likely to be more robust than similar stand-alone applications because many important tasks are handled by the DBMS (and do not have to be debugged and tested in the application).
     Given all these advantages, is there exist ever a reason not to use a DBMS? Sometimes, yes. A DBMS is a complex piece of software, optimized for certain  kinds  of workloads (e.g., answering complex queries or handling many concurrent  requests), and its performance may not be adequate for certain specialized applications. Examples include:
    • applications with tight real-time constraints or  just a few well-defined critical operations for which efficient custom code must  be written. 
    • Another reason for not using a DBMS is that an application may need to manipulate the data in ways not supported by the query language. In such a situation, the abstract view of the data presented by the DBMS does not match the application's needs and actually gets in the way. As an example, relational databases do not support flexible analysis of text data (although vendors are now extending their products in this direction).
    If specialized performance or data manipulation requirements are central to anapplication, the application may choose not to use a DBMS, especially if the added benefits of a DBMS (e.g., flexible querying, security, concurrent access,and crash recovery) are not required. In most situations calling for large-scale data management, however, DBMSs have become an indispensable tool.


    DESCRIBING AND STORING DATA IN A DBMS

       The user of a DBMS is ultimately concerned with some real-world enterprise, and the data to be stored describes various aspects of this enterprise. For example, there are students, faculty, and courses in a university, and the data in a university database describes these entities and their relationships.
    A data model is a collection of high-level data description constructs that hide many low-level storage details. 
       A DBMS allows a user to define the data to be stored in terms of a data model. Most database management systems today are based on the relational data model, which we focus on in this text.

       While the data model of the DBMS hides many details, it is nonetheless closer to how the DBMS stores data than to how a user thinks about the underlying enterprise.
    A semantic data model is a more abstract, high-level data model that makes it easier for a user to come up with a good initial description of the data in an enterprise.
    These models contain a wide variety of constructs that help describe a real application scenario. A DBMS is not intended to support all these constructs directly; it is typically built around a data model with just a few basic constructs, such as the relational model.
    A database design in terms of a semantic model serves as a useful starting point and is subsequently translated into a database design in terms of the data model the DBMS actually supports
    An Example of Poor Design: The relational schema for Students (see below) illustrates a poor design choice; you should never create a field such as age, whose value is constantly changing. A better choice would be DOB (for date of birth); age can be computed from this. We continue to use age in our examples, however, because it makes them easier to read.
    A widely used semantic data model called the entity-relationship (ER) model allows us to pictorially denote entities and the relationships among them.

    The Relational Model

    In this section we provide a brief introduction to the relational model. The central data description construct in this model is a relation, which can be thought of as a set of records.
    A description of data in terms of a data model is called a schema
    In the relational model, the schema for a relation specifies:
    1. its name, 
    2. the name of each field (or attribute or column), and 
    3. the type of each field. 
    As an example, student information in a university database may be stored in a relation with the following schema:
     Students( sid: string, name: string, login: string, age: integer, gpa: real)
       The preceding schema says that each record in the Students relation has five fields, with field names and types as indicated. An example instance of the
    Students relation appears in Figure 1.1. Each row in the Students relation is a record that describes a student. The description is not complete----for example, the student's height is not included--- but is presumably adequate for the intended applications in the university database. Every row follows the schema of the Students relation. The schema call therefore be regarded as a template for describing a relation (student in our case).

       We can make the description of a collection of students more precise by specifying integrity constraints, which are conditions that the records in a relation must satisfy. For example, we could specify that every student has a unique sid value. Observe that we cannot capture this information by simply  adding another field to the Students schema.
    Thus, the ability to specify  uniqueness of the values in a field increases the accuracy with which we can describe our data. The expressiveness of the constructs available for specifying integrity constraints is an important aspect of a data model.

    Other Data Models

    In addition to the relational data model (which is used in numerous systems, including IBM's DB2, Informix, Oracle, Sybase, Microsoft's Access, FoxBase, Paradox, Tandem, and Teradata), other important data models include the
       While many databases use the hierarchical and network models and systems based on the object-oriented and object-relational models are gaining acceptance in the marketplace, the dominant model today is the relational model.
       Indeed, the object-relational model, which is gaining in popularity, is an effort to combine the best features of the relational and object-oriented models, and a good grasp of the relational model is necessary to understand object-relational concepts.


    Levels of Abstraction in a DBMS

    The data in a DBMS is described at three levels of abstraction, are illustrated in Figure 1.2. The database description consists of a schema at each of these three levels of abstraction:
    • the conceptual, 
    • physical, and 
    • external.
    A data definition language (DDL) is used to define the external and conceptual schemas. We discuss the DDL facilities of the most widely used database language, SQL.
    All DBMS vendors also support SQL commands to describe aspects of the physical schema, but these commands are not part of the SQL language standard. 
    Information about the conceptual, external, and physical schemas is stored in the system catalogs (DBMS management structures). We discuss the three levels of abstraction in the rest of this section.


    Conceptual Schema
    The conceptual schema (sometimes called the logical schema) describes the stored data in terms of the data model of the DBMS
    In a relational DBMS, the conceptual schema describes all relations that are stored in the database. In our sample university database, these relations contain information about:
    • entities, such as students and faculty, and about 
    • relationships, such as student's enrollment in courses. 
    All student entities can be described using records in a Students relation, as we saw earlier. In fact, each collection of entities and each collection of  relationships can be described as a relation, leading to the following conceptual schema:

    Students ( sid: string, name: string, login: string,
               age: integer, gpa: real)
    Faculty  (fid: string, fname: string, sal: real)
    Courses  (cid: string, cname: string, credits: integer)
    Rooms    (nw: integer, address: string, capacity: integer)
    Enrolled ( sid: string, cid: string, grade: string)
    Teaches  (fid: string, cid: string)
    Meets_In ( cid: string, rno: integer, ti'fne: string)
    


    The choice of relations, and the choice of fields for each relation, is not always
    obvious, and the process of arriving at a good conceptual schema is called conceptual database design.


    Physical Schema
    The physical schema specifies additional storage details. Essentially, the physical schema summarizes how the relations described in the conceptual schema are actually stored on secondary storage devices such as disks and tapes. 
       We must decide what file organizations to use to store the relations and create auxiliary data structures, called indexes, to speed up data retrieval operations.
       A sample physical schema for the university database follows:
    • Store all relations as unsorted files of records. (A file in a DBMS is either a collection of records or a collection of pages, rather than a string of characters as in an operating system.)
    • Create indexes on the first column of the Students, Faculty, and Courses relations, the sal column of Faculty, and the capacity column of Rooms.
    Decisions about the physical schema are based on an understanding of how the data is typically accessed. The process of arriving at a good physical schema is called physical database design.


    External Schema
    External schemas, which usually are also in terms of the data model of  the DBMS, allow data access to be customized (and authorized) at the level  of individual users or groups of users. 
    Any given database has exactly one  conceptual schema and one physical schema because it has just one set of  stored relations, but it may have several external schemas, each tailored to a  particular group of users.
    Each external schema consists of a collection of one or  more views and relations from the conceptual schema. A view is conceptually  a relation, but the records in a view are not stored in the DBMS. Rather, they are computed using a definition for the view, in terms of relations stored in the  DBMS.
    The external schema design is guided by end user requirements. For example,  we might want to allow students to find out the names of faculty members  teaching courses as well as course enrollments. This can be done by defining  the following view:
    Courseinfo( rid: string, fname: string, enrollment: integer)
    A user can treat a view just like a relation and ask questions about the records  in the view. Even though the records in the view are not stored explicitly, they are computed as needed.
    We did not include Courseinfo in the conceptual schema because we can compute Courseinfo from the relations in the conceptual schema, and to store it in addition would be redundant. Such redundancy, in addition to the wasted space, could lead to inconsistencies. 
    For example, a tuple may be inserted into the Enrolled relation, indicating that a particular student has enrolled in some course, without incrementing the value in the enrollment field of the corresponding record of Courseinfo (if the latter also is part of the conceptual schema and its tuples are stored in the DBMS).


    Data Independence

    A very important advantage of using a DBMS is that it offers data independence.
    That is, application programs are insulated from changes in the way the data is structured and stored. Data independence is achieved through use of the three levels of data abstraction; 
    in particular, the conceptual schema and the external schema provide distinct benefits in this area.

    Relations in the external schema (view relations) are in principle generated on demand from the relations corresponding to the conceptual schema. If the  underlying data is reorganized, that is, the conceptual schema is changed, the definition of a view relation can be modified so that the same relation is computed as before (In practice, they could be precomputed and stored to speed up queries on view relations, but the computed view relations must be updated whenever the underlying relations are updated.). For example, suppose that the Faculty relation in our university database is replaced by the following two relations:
    Faculty_public (fid: string, fname: string, office: integer)
    Faculty_private (fid: string, sal: real)
       Intuitively, some confidential information about faculty has been placed in a separate relation and information about offices has been added. The Courseinfoview relation can be redefined in terms of Faculty_public and Faculty_private, which together contain all the information in Faculty, so that a user who queries Courseinfo will get the same answers as before.

       Thus, users can be shielded from changes in the logical structure of the data, or changes in the choice of relations to be stored. This property is called logical data independence.

       In turn, the conceptual schema insulates users from changes in physical storage details. This property is referred to as physical data independence. The conceptual schema hides details such as how the data is actually laid out on disk, the file structure, and the choice of indexes. As long as the conceptual  schema remains the same, we can change these storage details without altering applications. (Of course, performance might be affected by such changes.)


    QUERIES IN A DBMS

    The ease with which information can be obtained from a database often determines its value to a user. In contrast to older database systems, relational database systems allow a rich class of questions to be posed easily; this feature has contributed greatly to their popularity. Consider the sample university database discussed earlier. Here are some questions a user might ask:
    1. What is the name of the student with student ID 1234567 
    2. What is the average salary of professors who teach course CS5647
    3. How many students are enrolled in CS5647
    4. What fraction of students in CS564 received a grade better than B7
    5. Is any student with a CPA less than 3.0 enrolled in CS5647
    Such questions involving the data stored in a DBMS are called queries. A DBMS provides a specialized language, called the query language, in which queries can be posed. A very attractive feature of the relational model is that it supports powerful query languages.
    • Relational calculus is a formal query language based on mathematical logic, and queries in this language have an intuitive, precise meaning.  
    • Relational algebra is another formal query language, based on a collection of operators for manipulating relations, which is equivalent in power to the calculus.
       A DBMS takes great care to evaluate queries as efficiently as possible. Of course, the efficiency of query evaluation is determined to a large extent by how the data is stored physically. Indexes can be used to speed up many queries----in fact, a good choice of indexes for the underlying relations can speed up each query in the preceding list.

       A DBMS enables users to create, modify, and query data through a data manipulation language (DML). Thus, the query language is only one part of the DML, which also provides constructs to insert, delete, and modify data. We will discuss the DML features of SQL. The DML and DDL are collectively referred to as the data sublanguage when embedded within a host language (e.g., C or COBOL).


    TRANSACTION MANAGEMENT

    Consider a database that holds information about airline reservations. At any given instant, it is possible (and likely) that several travel agents are looking up information about available seats on various flights and making new seat reservations. When several users access (and possibly modify) a database concurrently, the DBMS must order their requests carefully to avoid conflicts.
    • For example, when one travel agent looks up Flight 100 on some given day and finds an empty seat, another travel agent may simultaneously be making a reservation for that seat, thereby making the information seen by the first agent obsolete.
    • Another example of concurrent use is a bank's database. While one user's application program is computing the total deposits, another application may  transfer money from an account that the first application has just 'seen' to an account that has not yet been seen, thereby causing the total to appear larger than it should be. 
    Clearly, such anomalies should not be allowed to occur. However, disallowing concurrent access can degrade performance. Further, the DBMS must protect users from the effects of system failures by ensuring that all data (and the status of active applications) is restored to a consistent state when the system is restarted after a crash.
    • For example, if a travel agent asks for a reservation to be made, and the DBMS responds saying that the reservation has been made, the reservation should not be lost if the system crashes. On the other hand, if the DBMS has not yet responded to the request, but is making the necessary changes to the data when the crash occurs, the partial changes should be undone when the system comes back up.
    A transaction is anyone execution of a user program in a DBMS. (Executing the same program several times will generate several transactions.) This is the basic unit of change as seen by the DBMS: Partial transactions are not allowed, and the effect of a group of transactions is equivalent to some serial execution of all transactions. We briefly outline how these properties are guaranteed, deferring a detailed discussion to later sections.

    Concurrent Execution of Transactions

    An important task of a DBMS is to schedule concurrent accesses to data so  that each user can safely ignore the fact that others are accessing the data  concurrently. The importance of this task cannot be underestimated because  a database is typically shared by a large number of users, who submit their  requests to the DBMS independently and simply cannot be expected to deal  with arbitrary changes being made concurrently by other users.
    A DBMS allows users to think of their programs as if they were executing in isolation, one after the other in some order chosen by the DBMS. 
    For example, if a program that deposits cash into an account is submitted to the DBMS at the same time as another program that debits money from the same account, either of these programs could be run first by the DBMS, but their steps will not be interleaved in such a way that they interfere with each other. A locking protocol is a set of rules to be followed by each transaction (and enforced by the DBMS) to ensure that, even though actions of several  transactions might be interleaved, the net effect is identical to executing all transactions in some serial order. A lock is a mechanism used to control access to database objects. Two kinds of locks are commonly supported by a DBMS:
    1. shared locks on an object can be held by two different transactions at the same time, 
    2. but an exclusive lock on an object ensures that no other transactions hold any lock on this object.
    Suppose that the following locking protocol is followed:
    1. Every transaction begins by obtaining a shared lock on each data object that it needs to read and an exclusive lock on each data object that it needs to modify, then releases all its locks after completing all actions.  
    2. Consider two transactions T1 and T2 such that T1 wants to modify a data object and T2 wants to read the same object. 
    3. Intuitively, if T1's request for an exclusive lock on the object is granted first, T2 cannot proceed until T1 releases this lock, because T2's request for a shared lock will not be granted by the DBMS until then. Thus, all of T1's actions will be completed before any of T2's actions are initiated.

    Incomplete Transactions and System Crashes

       Transactions can be interrupted before running to completion for a variety of reasons, e.g., a system crash. A DBMS must ensure that the changes made by such incomplete transactions are removed from the database. For example, if the DBMS is in the middle of transferring money from account A to account B and has debited the first account but not yet credited the second when the crash occurs, the money debited from account A must be restored when the system comes back up after the crash.

       To do so, the DBMS maintains a log of all writes to the database. A crucial property of the log is that each write action must be recorded in the log (on disk) before the corresponding change is reflected in the database itself--otherwise, if the system crshes just after making the change in the database but before the  change is recorded in the log, the DBIVIS would be unable to detect and undo  this change. This property is called Write-Ahead Log (WAL).
    To ensure this property, the DBMS must be able to selectively force a page in memory to disk. The log is also used to ensure that the changes made by a successfully completed transaction are not lost due to a system crash. Bringing the database to a consistent state after a system crash can be a slow process, since the DBMS must ensure that the effects of all transactions that completed prior to the crash are restored, and that the effects of incomplete transactions are undone. The time required to recover from a crash can be reduced by periodically forcing some information to disk; this periodic operation is called a checkpoint.

    Points to Note

    In summary, there are three points to remember with respect to DBMS support for concurrency control and recovery:
    1. Every object that is read or written by a transaction is first locked in shared or exclusive mode, respectively. Placing a lock on an object restricts its availability to other transactions and thereby affects performance.
    2. For efficient log maintenance, the DBMS must be able to selectively force a collection of pages in main memory to disk. Operating system support for this operation is not always satisfactory.
    3. Periodic checkpointing can reduce the time needed to recover from a crash. Of course, this must be balanced against the fact that checkpointing too often slows down normal execution.

    STRUCTURE OF A DBMS

    Figure 1.3 shows the structure (with some simplification) of a typical DBMS based on the relational data model.
    1. The DBMS accepts SQL commands generated from a variety of user interfaces, 
    2. produces query evaluation plans, 
    3. executes these plans against the database, and 
    4. returns the answers. (This is a simplification: SQL commands can be embedded in host-language application programs, e.g., Java or COBOL programs. We ignore these issues to concentrate on the core DBMS functionality.)
    When a user issues a query,
    1. the parsed query is presented to a query optimizer, which uses information about how the data is stored to produce an efficient execution plan for evaluating the query. 
    2. An execution plan is a blueprint for evaluating a query, usually represented as a tree of relational operators (with annotations that contain additional detailed information about which access methods to use, etc.). 
    3. Relational operators serve as the building blocks for evaluating queries posed against the data.
    4. The code that implements relational operators sits on top of the file and access methods layer. 
    5. This layer supports the concept of a file, which, in a DBMS, is a collection of pages or a collection of records. Heap files, or files of unordered pages, as well as indexes are supported. 
    6. In addition to keeping track of the pages in a file, this layer organizes the information within a page.
    7. The files and access methods layer code sits on top of the buffer manager, which brings pages in from disk to main memory as needed in response to read requests. 
    8. The lowest layer of the DBMS software deals with management of space on disk, where the data is stored. Higher layers allocate, deallocate, read, and write pages through (routines provided by) this layer, called the disk space manager.
       
    9. The DBMS supports concurrency and crash recovery by carefully  scheduling user requests and maintaining a log of all changes to the database. DBMS components associated with concurrency control and recovery include the transaction manager, which ensures that transactions request and release locks according to a suitable locking protocol and schedules the execution transactions; the lock manager, which keeps track of requests for locks and grants locks on database objects when they become available; and the recovery manager, which is responsible for maintaining a log and restoring the system to a consistent state after a crash. The disk space manager, buffer manager, and file and access method layers must interact with these components.

    PEOPLE WHO WORK WITH DATABASES

    Quite a variety of people are associated with the creation and use of databases. Obviously, there are database implementors, who build DBMS software, and end users who wish to store and use data in a DBMS. Database implementors work for vendors such as IBM or Oracle. End users come from a diverse and increasing number of fields. As data grows in complexity and(volume, and is increasingly recognized as a major asset, the importance of maintaining it professionally in a DBMS is being widely accepted. Many end users simply use applications written by database application programmers (see below) and so require little technical knowledge about DBMS software. Of course, sophisticated users who make more extensive use of a DBMS, such as writing their own queries, require a deeper understanding of its features.

    In addition to end users and implementors, two other classes of people are associated with a DBMS:
    1. application programmer's and 
    2. database  administrators.
    Database application programmers develop packages that facilitate data access for end users, who are usually not computer professionals, using the host or data languages and software tools that DBMS vendors provide. (Such tools include report writers, spreadsheets, statistical packages, and the like.) Application programs should ideally access data through the external schema. It is possible to write applications that access data at a lower level, but such applications would comprornise data independence.



    A personal database is typically maintained by the individual who owns it and uses it. However, corporate or enterprise-wide databases are typically important enough and complex enough that the task of designing and maintaining the database is entrusted to a professional, called the database administrator  (DBA). The DBA is responsible for many critical tasks:
    • Design of the Conceptual and Physical Schemas: The DBA is responsible for interacting with the users of the system to understand what data is to be stored in the DBMS and how it is likely to be used. Based on this knowledge, the DBA must design the conceptual schema (decide what relations to store) and the physical schema (decide how to store them).
    • The DBA may also design widely used portions of the external schema, although users probably augment this schema by creating additional views.
    • Security and Authorization: The DBA is responsible for ensuring that unauthorized data access is not permitted. In general, not everyone should be able to access all the data. In a relational DBMS, users can be granted permission to access only certain views and relations. For example, although you might allow students to find out course enrollments and who teaches a given course, you would not want students to see faculty salaries or each other's grade information. The DBA can enforce this policy by giving students permission to read only the Courseinfo view.
    • Data Availability and Recovery from Failures: The DBA must take steps to ensure that if the system fails, users can continue to access as much of the uncorrupted data as possible. The DBA must also work to restore the data to a consistent state. The DBMS provides software support for these functions, but the DBA is responsible for implementing procedures to back up the data periodically and maintain logs of system activity (to facilitate recovery from a crash).
    • Database Tuning: Users' needs are likely to evolve with time. The DBA is  responsible for modifying the database, in particular the conceptual and physical schemas, to ensure adequate performance as requirements change.


    INTRODUCTION TO DATABASE DESIGN


       The Entity-Relationship (ER) data model allows us to describe the data involved in a real-world enterprise in terms of objects and their relationships and is widely used to develop an initial database design. It provides useful concepts that allow us to move from an informal description of what users want from their database to a more detailed, precise description that can be implemented in a DBMS. Here, we introduce the ER model and discuss how its features allow us to model a wide range of data faithfully. Within the larger context of the overall design process, the ER model is used in a phase called conceptual database design.
        We note that many variations of ER diagrams are in use and no widely accepted standards prevail. The presentation in this section is representative of the family of ER models and includes a selection of the most popular features.


    DATABASE DESIGN AND ER DIAGRAMS

    We begin our discussion of database design by observing that this is typically just one part, although a central part in data-intensive applications, of a larger  software system design. Our primary focus is the design of the database, however, and we will not discuss other aspects of software design in any detail.
       The database design process can be divided into six steps. The ER model is
     most relevant to the first three steps.
    1. Requirements Analysis: The very first step in designing a database application is to understand what data is to be stored in the database, what applications must be built on top of it, and what operations are most frequent and subject to performance requirements. In other words, we must find out what the users want from the database. This is usually an informal process that involves discussions with user groups, a study of the current operating environment and how it is expected to change analysis of any available documentation on existing applications that are expected to be replaced or complemented by the database, and so on.Several methodologies have been proposed for organizing and presenting the information gathered in this step, and some automated tools (Sybase ,Oracle's ) have been developed to support this process.
    2. Conceptual Database Design: The information gathered in the requirements analysis step is used to develop a high-level description of the data to be stored in the database, along with the constraints known to hold over this data. This step is often carried out using the ER model and is discussed in the rest of this section. The ER model is one of several high-level, or semantic, data models used in database design. The goal is to create a simple description of the data that closely matches how users and developers think of the data (and the people and processes to be represented in the data). This facilitates discussion among all the people involved in the design process, even those who have no technical background. At the same time, the initial design must be sufficiently precise to enable a straightforward translation into a data model supported by a commercial database system (which, in practice, means the relational model).  
    3. Logical Database Design: We must choose a DBMS to implement  our databctse design, and convert the conceptual database design into a database schema in the data model of the chosen DBMS. We will consider only relational DBMSs, and therefore, the task in the logical design step is to convert an ER schema into a relational database schema. We discuss this step in detail later; the result is a conceptual schema, sometimes called the logical schema, in the relational data model.
    4. Schema Refinement: The fourth step in database design is to analyze the collection of relations in our relational database schema to identify potential problems, and to refine it. In contrast to the requirements analysis and conceptual design steps, which are essentially subjective, schema refinement can be guided by some elegant and powerful theory. We discuss the theory of normalizing relations-restructuring them to ensure some desirable properties. 
    5. Physical Database Design: In this step, we consider typical expected workloads that our database must support and further refine the database design to ensure that it meets desired performance criteria. This step may simply involve building indexes on some tables and clustering some tables,or it may involve a substantial redesign of parts of the database schema obtained from the earlier design steps.
    6. Application and Security Design: Any software project that involves a DBMS must consider aspects of the application that go beyond the database itself. Design methodologies like UML (Section 2.7) try to address the complete software design and development cycle. Briefly, we must identify the entities (e.g., users, user groups, departments) and processes involved in the application. We must describe the role of each entity in every process that is reflected in some application task, as part of a complete workflow for that task. For each role, we must identify the parts of the database that must be accessible and the parts of the database that must not be accessible, and we must take steps to ensure that these access rules are enforced. A DBMS provides several mechanisms to assist in this step.

    Beyond ER Design

       The ER diagram is just an approximate description of the data, constructed through a subjective evaluation of the information collected during requirements analysis. A more careful analysis can often refine the logical schema obtained at the end of Step 3. Once we have a good logical schema, we must consider performance criteria and design the physical schema. Finally, we must address security issues and ensure that users are able to access the data they need, but not data that we wish to hide from them. The remaining three steps of database design are briefly described next:
        In the implementation phase, we must code each task in an application language (e.g., Java), using the DBMS to access data. In general, our division of the design process into steps should be seen as a classification of the kinds of steps involved in design. Realistically, although we might begin with the six step process outlined here, a complete database design will probably require a subsequent tuning phase in which all six kinds of design steps are interleaved and repeated until the design is satisfactory.


    ENTITIES, ATTRIBUTES, AND ENTITY SETS

    An entity is an object in the real world that is distinguishable from other objects. Examples include the following: the Green Dragonzord toy, the toy department, the manager of the toy department, the home address of the manager of the toy department. It is often useful to identify a collection of similar entities. Such a collection is called an entity set. Note that entity sets need not be disjoint; the collection of toy department employees and the collection of appliance department employees may both contain employee John Doe (who happens to work in both departments). We could also define an entity set called Employees that contains both the toy and appliance department employee sets. An entity is described using a set of attributes. All entities in a given entity set have the same attributes; this is what we mean by similar. (This statement is an  over- simplification, as we will see when we discuss inheritance hierarchies i, but it  suffices for now and highlights the main idea.) Our choice of attributes reflects the level of detail at which we wish to represent information about entities. For example, the Employees entity set could use name, social security number (ssn), and parking lot (lot) as attributes. In this case we will store the name, social security number, and lot number for each employee. However, we will not store, say, an employee's address (or gender or age). For each attribute associated with an entity set, we must identify a domain of possible values. For example, the domain associated with the attribute name of Employees might be the set of 20-character strings (To avoid confusion, we assume that attribute names do not repeat across entity sets. This is not a real limitation because we can always use the entity set name to resolve ambiguities if the same attribute name is used in more than one entity set). As another example, if the company rates employees on a scale of 1 to 10 and stores ratings in a field called mting, the associated domain consists of integers 1 through 10. Further, for each entity set, we choose a key. A key is a minimal set of attributes whose values uniquely identify an entity in the set. There could be more than one candidate key; if so, we designate one of them as the primary key. For now we assume that each entity set contains at least one set of attributes that uniquely identifies an entity in the entity set; that is, the set of attributes contains a key.
       The Employees entity set with attributes ssn, name, and lot is shown in Figure  2.1. An entity set is represented by a rectangle, and an attribute is represented  by an oval. Each attribute in the primary key is underlined. The domain information could be listed along with the attribute name, but we omit this to keep the figures compact. The key is ssn.
       

    RELATIONSHIPS  AND RELATIONSHIP SETS

    A relationship is an association among two or more entities. For example, we
     may have the relationship that Attishoo works in the pharmacy department.





    Resources

    DATABASE MANAGEMENT SYSTEMS, THIRD EDITION(2003)
    By Raghu Ramakrishnan, Johannes Gehrke
    ISBN 0-07-123151-X

    Friday, February 12, 2010

    Programming languages --A high level point of view for beginners

    Trying to pick up programming indirectly from a text is like
    trying to learn judo by reading a pamphlet. In both 
    cases, you may glean a theoretical understanding
    of the subject, but until you actually practice
    your skill, you don’t know how
    much you really picked up.


    -->
    INTRO
    Why Learn Computer Programming?
    Computer programming is a skill that anyone can pick up, given enough practice, patience, and caffeinated beverages. The answer depends on your ultimate goals, but some common answers to consider are:
    • For fun: Programming a computer can prove fun because you can, for example, design simple programs that display your portfolio on the computer. More complex programs probably may make you earn dollars. A stand-up comedian wrote (in BASIC for his own amusement, using a program known as CA-Realizer) a program known as Comedy Writer. Then he decided to sell the program to others.
    • To fill a need: Many people learn programming with no intention of becoming a full-time, professional programmer. They just want a program that solves a particular problem, but they can’t find a program that does it, so they write the program themselves. A psychologist who specialized in dream interpretation used his knowledge and a program known as ToolBook to create and sell DreamScape, a program that interprets the meaning of dreams. Whatever your interests, you can write a program to solve a specific problem that others may find useful as well.
    • For a new or second career: With computers taking over the world, you’re never unemployed for long if you know how to program a computer. Companies are always looking to create new programs, but you also find a growing market for programmers who can maintain and modify the millions of existing programs that do everything from storing hotel reservations to transferring bank deposits electronically. If you know how to program a computer, you’re in a much better position to earn a lot of money and live wherever you want. You may still want to keep your current job, but programming gives you a new way to expand and share your knowledge. A group of alternative health-care practitioners, for example, wrote IBIS, a program that provides information for treating a variety of aliments by using acupuncture, massage, diet, and homeopathy. They wrote IBIS by using a program known as MetaCard.
    • As an intellectual challenge: Many people find the sheer complexity of computers as fascinating as studying a mathematical puzzle. Not surprisingly, computers tend to attract people of above-average intelligence who enjoy programming a computer to pry into the thought processes of their own minds. To help turn a computer into a thinking tool, one programmer created the Axon Idea Processor by using Prolog, a popular programming language used for researching artificial intelligence. The goal was to create a program to help people manipulate ideas, concepts, and facts so that they can devise a variety of possible solutions while better understanding their own way of thinking in the process. If using a computer normally seems boring, try writing your own program to help you use your brain more effectively.
      If you aren’t doing what you truly enjoy, all the money in the world isn’t going to make your life better. Choose to learn programming because you want to — not because you think that it’s going to make you rich.
    How Does a Computer Program Work?
     To make the computer do something useful, you must give it instructions in either of the following two ways:
    • Write a program, which tells a computer what to do, step-by-step, much as you write out a recipe.
    • Buy a program that someone else has already written that tells the computer what to do.
    So, to get a computer to do something useful, you (or somebody else) must write a program. A program does nothing more than tell the computer how to
    • accept some type of input, 
    • manipulate that input, and 
    • spit it back out again in some form that humans find useful.

    Programming is problem-solving
    Essentially, a program tells the computer how to solve a specific problem. The number and variety of programs that people can write for computers is practically endless.

    But to tell a computer how to solve one big problem, you usually must tell the computer how to solve a bunch of little problems that make up the bigger problem(use the divide and conquer strategy). If you want to make your own video game, for example, you need to solve some of the following problems:
    • Determine how far to move a cartoon figure (such as a car, a spaceship, or a man) on-screen as the user moves a joystick.
    • Detect whether the cartoon figure bumps into a wall, falls off a cliff, or runs into another cartoon figure on-screen.
    • Make sure that the cartoon figure doesn’t make any illegal moves, such as walking through a wall.
    • Draw the terrain surrounding the cartoon figure and make sure that if the cartoon figure walks behind an object such as a tree, the tree realistically blocks the figure from sight.
    • Determine whether bullets that another cartoon figure fires are hitting the player’s cartoon figure. If so, determine the amount of damage, how it affects the movement of the damaged cartoon figure, and how the damage appears on-screen.
    The simpler that the problem is that you need to solve, the more easily you can write a program that tells the computer how to work(Divide et impera).
    A program that displays a simple Ping-Pong game with two stick paddles and a ball is much easier to write than a program that displays World War II fighter airplanes firing machine guns and dropping bombs on moving tanks, while  dodging anti-aircraft fire.

    Programming isn’t difficult it’s just time- consuming
    If you can write step-by-step instructions directing someone to your house, you can write a program. The hardest part about programming is identifying all the little problems that make up the big problem that you’re trying to solve. Because computers are completely stupid, you need to tell them how to do everything. If you’re giving a friend instructions to get to your house, for example, you may write down the following information:
    1. Go south on Highway I-5.
    2. Get off at the Sweetwater Road exit.
    3. Turn right at the light.
    4. Turn into the second driveway on the left.
    Of course, if you try giving these instructions to a computer, the computer gets confused and wants to know the following additional information:
    1. Where do I start and exactly how far south do I drive down Highway I-5?
    2. How do I recognize the Sweetwater Road exit, and how do I get off at this exit?
    3. After I turn right at the light, how far to the right do I turn, and do you mean the traffic light or the street light on the corner?
    4. After I turn into the second driveway on the left, what do I do next? Park the car? Honk the horn? Gun the engine and accelerate through your garage door?
    You need to tell computers how to do everything, which can make giving them instructions as aggravating and frustrating as telling children what to do. Unless you specify everything that you want the computer to do and exactly how to do it, the computer just plain doesn’t do what you want it to do.
    Know that If you ever start  writing a program and feel like giving up before it ever works, you’re not alone.

    What Do I Need to Know to Program a Computer?
    If you’re the type who finds the idea of making a program (such as a video game) more exciting than actually using it, you already have everything you need to program a computer. If you want to learn computer programming, you need a healthy dose of the following three qualities:
    • Desire: If you want something badly enough, you tend to get it (although you may serve time in prison afterward if you do something illegal to get it). If you have the desire to learn how to program a computer, your desire helps you learn programming, no matter what obstacles may get in your way.
    • Curiosity: A healthy dose of curiosity can encourage you to experiment and continue learning about programming long after you finish reading this sentences. With curiosity behind you, learning to program seems less a chore and more fun. And as long as you’re having fun, you tend to learn and retain more information than does someone without any curiosity whatsoever.
    • Imagination: Computer programming is a skill, but imagination can give your skill direction and guidance. A mediocre programmer with lots of imagination always creates more interesting and useful programs than a great programmer with no imagination. If you don’t know what to do with your programming skill, your talent goes to waste without imagination prodding you onward. 
    Desire, curiosity, and imagination are three crucial ingredients that every programmer needs. If you possess these qualities, you can worry about trivial details such as learning a specific programming language (such as C++). Learning to program a computer may (initially) seem an impossible task, but don’t worry. Computer programming is relatively simple to understand; everything just tends to fall apart after you try to put a program into actual use.

    Programming Languages
    Because computers are functionally brain-dead, people must write instructions for a computer by using a special language, hence, the term programming language.
    A collection of instructions that tell the computer what to do is known as a program. The instructions, written in a specific programming language, is known as the source code.
    ---p19

    In general, the easier the programming language is to read and write, the slower and larger are the programs it creates.
    The Holy Grail of computer programming is to create programs that are easy to write, run as fast as possible, and take up as little space as possible.
    By using assembly language, programmers sacrifice readability for speed and size.
    A program that you write in C runs slower and creates larger program files than does an equivalent assembly language program. That’s because assembly language is closer to the native language of computers (which is machine code) than C.
    So C programs need to first get translated into assembly language code before finally being converted into machine language code.
    This two-step process tends to be less efficient than writing an equivalent assembly language program. C source code, however, is much easier to read,
    write, and modify than assembly language source code (and assembly language source code far easier to read, write, and modify than an equivalent machine-language source code).

    assembly > C
    As is true of translations between human languages, the simpler the language, the easier is the translation. Translating a children’s book from French
    into Japanese is much easier than translating a mathematics dissertation
    from French into Japanese, mainly because a children’s book uses simple
    words, while a mathematics dissertation uses more complicated words.
    Similarly, translating C into machine language code is more difficult than
    translating assembly language into machine language code.

    So the only way that you can run a C program on another computer is if
    someone’s already written a C compiler for that other computer.
    Because C is a simple language, writing C compilers for different computers is relatively easy, especially if you compare it with the same task for other programming languages, such as Ada or LISP.

    Because C compilers are fairly easy to write, you can find C compilers for
    almost every computer in the world. Theoretically, you can write a C pro-
    gram for the Macintosh, copy it to a computer running Windows XP, recompile it, and run the program with little or no modification.
    Given its power and portability, C has quickly become one of the most popular programming languages in the world. The majority of all programs are
    written in C although most newer programs are now written in a C derivative
    language called C++. Some of the more famous (or infamous) programs that
    have been written in C or C++ include operating systems such as MS's Windows
    95 / 98 / Me / NT/2000 / XP / vista ecc , Unix, and Linux, as well as major commercial programs such as Quicken, Netscape Navigator, and Microsoft Word.

    Although C is popular, it has its share of flaws:
    1. C creates larger and slower programs than equivalent assembly or machine-language programs.
    2. The C language gives programmers access to all parts of a computer, including the capability to manipulate the computer’s memory.
    Unfortunately, all this power can prove as dangerous as giving a hyper- active monkey a chainsaw and a hand grenade. If you don’t write your C programs carefully, they can accidentally wreck your computer’s memory, causing your program to crash your computer. 
    In a desperate attempt to make C programming more reliable, programmers
    developed languages similar to C, such as C++, Java, Perl, Python, C# etc. All
    of these C-derived languages add a special feature known as object-orientation,
    which encourages programmers to write small programs that they can easily
    reuse and modify.
    In addition, these other languages try to protect programmers from writing programs that can mess up the computer’s memory, as C programs can do, which decreases the chance of writing a program that crashes an entire computer.

    Indeed. The best language for almost any job is likely to be the one you 
    already know - if you only have a hammer everything looks like a nail.  
    So the moral of the story is that you can make programming a lot easier if you just choose the right programming language to help you solve the right problem.

    The Tools of a Computer Programmer
    You need the following two crucial tools to write a program:
    • An editor (so that you can write your instructions to the computer).
    • A compiler which converts your instructions into machine language so that the computer knows what you want it to do. Instead of using a compiler, many programming languages use an interpreter. The main difference between the two is that an interpreter converts your instructions into machine language and stores them in memory each time you run the program, whereas a compiler converts your instructions into machine language once and saves those instructions in a executable file (often called an EXE in Windows world).

    You may want to use the following additional tools in writing a program:
    • A debugger (which helps identify problems --called also bugs-- in your program).
    • A Help file authoring program (so that your program can provide Help on-screen instead of needing to supply the user with a decent manual).
    • An installation program (to copy your program to the user’s computer).

    If you buy a specific programming language, such as Visual Basic, Delphi, or Real Basic, you usually get an editor, compiler, and a debugger, which means you just need to buy a separate Help file authoring program and an installation program.


    Writing Programs in an Editor
    As you write a program, you must type your instructions in a text (or ASCII) file. Although you can use a word processor to create a text file, a word processor offers fancy formatting features (such as changing fonts or underlining text), which you don’t need to write a program. An ASCII file consists of nothing but characters that you can type from a keyboard. ASCII stands for American Standard Code for Information Interchange, which is simply a universal file format that any computer can use.
     A program consists of one or more instructions that tell the computer what to do. The instructions that make up a program are known as the program’s source code.
    Rather than struggle with a word processor, programmers created special programs for writing, editing, and printing the source code of a program. Almost no one writes a program correctly the first time, so the majority of a programmer’s time is spent editing and debugging the source code. As a result, the program that enables you to write, edit, and print a program is known as an text editor(or editor).

    An editor looks like a word processor but may offer special features to make programming easier, such as :
    • automatically formatting your source code,
    • offering shortcuts for editing your source code,
    • or providing pop-up Help as you’re typing program commands(in the programming language you use).
    Anytime that you need to write or modify the source code of a program, you must use an editor. 

    Using a Compiler or an Interpreter
     After you type your instructions in an editor by using a programming language such as C++ or Java, guess what? The computer doesn’t have the slightest idea what you just created. 
    Computers understand only machine language, so you need to use another special program to convert your source code (the instructions that you write in C++ or Java) into machine language.

    You can use either of the following two types of programs to convert source code into machine language:
    • A compiler
    • An interpreter
     Compilers
    1. A compiler takes your source code, 
    2. converts the entire thing into machine language, and 
    3. then stores these equivalent machine language instructions in a separate file, often known as an executable file
    The process is like having a translator study an entire novel written in Spanish and then translate it into Arabic. 
    Whenever a compiler converts source code into machine language, it’s called compiling a program.
    After you compile a program, you can just give away copies of the executable (machine-language) version of your program without giving away your source code version. 
    As a result, most commercial programs (such as Microsoft PowerPoint and Quicken) are compiled. 
    After you use a compiler to convert source code into machine language, you never need to use the compiler again (unless you make changes to your source code).
    A compiler creates machine language for a specific microprocessor, such as the PowerPC (which the Apple Macintosh uses) or the Intel Pentium family of microprocessors (including clone microprocessors, such as the AMD Athlon). 

    If you write a program in BASIC and want to run it on a Macintosh and a Windows computer, you need to compile your program twice:
    • once for the Macintosh and
    • once for the Windows environment.
    Not all compilers are equal, although two different compilers may convert the same language into machine language. Given identical C++ source code, for example, one C++ compiler may create a program that runs quickly, whereas a second C++ compiler may create a smaller file that runs much slower.

    Interpreters
    A second, but less popular, way to convert source code into machine language is to use an interpreter. 
    An interpreter converts each line of your source code into machine language, one line at a time. The process is like giving a speech in English and having someone translate your sentences, one at a time, into another language (such as French).
    Unlike what a compiler does, an interpreter converts source code into machine language but stores the machine-language instructions in the computer’s memory. Every time that you turn off the computer, you lose the machine-language version of your program. To run the program again, you must feed the source code into the interpreter again.
    If anyone wants to run your program, that person needs both an interpreter and the source code for your program. Because your source code enables everyone to see how you wrote your program (and gives others a chance to copy or modify your program without your permission), very few commercial programs use an interpreter.
    Most Web-page programming languages use interpreters, such as JavaScript and VBScript
    Because different computers can view Web pages, you can’t compile programs that you write in JavaScript or VBScript into machine language. Instead, your computer’s browser uses an interpreter to run a JavaScript or VBScript program.
    In the old days, when computers were slow and lacking in sufficient memory and storage space, interpreters were popular because they gave you instant feedback. The moment you typed an instruction into the computer, the interpreter told you whether that instruction would work and even showed you the results. 
    With an interpreter, you could write and test your program at the same time. Now, computers are so fast that programmers find using a compiler easier than using an interpreter.
    P-code: A combination of compiler and interpreter
    Getting a program to run on different types of computers is often a big pain in the neck. Both Macintosh and Windows programs, for example, use pull-down menus and dialog boxes. You need to write one set of commands to create pull-down menus on the Macintosh, however, and a second set of commands to create the identical menus in Windows.

    Because one program almost never runs on multiple computers without extensive modification, programmers combined the features of a compiler with an interpreter to create something called p-code
    Instead of compiling source code directly into machine language, p-code compiles source code into a special intermediate file format. To run a program compiled into p-code, you use an interpreter. This two-step process means that after you compile your program into p-code, you can run your p-code interpreted program on any computer that has the right p-code interpreter.
    Java is the most popular programming language that uses p-code. After you compile a Java program into p-code, you can copy that p-code to a Macintosh, a Windows computer, or a Linux computer. As long as that computer uses a Java p-code interpreter, you can run the Java program on that computer without modification.
    Best of all, programs that you compile into p-code can run without the original source code, which means that you can protect your source code and still give your program away to others.
    Just in case you’re wondering, Liberty BASIC, which comes with this book, takes BASIC instructions and saves them in a separate file that uses p-code. 
    If you distribute any compiled programs that you create using Liberty BASIC, you need to also distribute a special run-time file that can translate your Liberty BASIC p-code on another computer.
    Naturally, p-code has its own disadvantages. 
    1. Programs that you create by using p-code tend to run much slower than programs that you compile directly into machine language. 
    2. Although p-code programs can run without a copy of the original source code that you use to create them, you can also decompile p-code programs.
    Decompiling a p-code program can reveal the original source code that the programmer used to create the program. So if you write a program in Java and compile it into p-code, a rival can decompile your p-code program and see your original Java source code. Your rival then ends up with a nearly identical copy of your source code, essentially stealing your program. 
    You can actually decompile any program, including programs that you compile into machine language. 
    But unlike with decompiling p-code programs,  decompiling a machine-language version of a program never gets you the original high-level language source code that the programmer used to write the program.  
    If you compile a program into machine language, the original source code can be written in C++, COBOL, FORTRAN, BASIC, Ada, LISP, Pascal, or any other programming language in the world. 
    Because the decompiler has no idea what language the original source code was written in, it can only decompile a machine-language version of a program into equivalent assembly language. 
    After you decompile a program into assembly language source code, you can rewrite or modify that source code. Decompiling effectively allows you to steal the ideas of others.

    So what do I use?
    • If you want to write programs to sell, use a compiler, which protects your original source code.
    • If you want to write a program to run on your Web page, you can use either an interpreter or p-code.
    • If you want to write a program that can run on different types of computers, p-code may prove your only choice. As a safer but more cumbersome alternative, you can also use multiple compilers and modify your program to run on each different computer. 
    The language that you choose can determine whether you can use
    • a compiler,
    • an interpreter,
    • or p-code.
    You often convert Java programs into p-code, for example, although you can compile them directly into machine language. On the other hand, you usually compile C++ and rarely interpret or convert it into p-code.


    Squashing Bugs with a Debugger 
    Few computer programs work 100% correctly, which explains why computers crash, lose airline reservations, or just act erratically at times. Mathematically, writing a program that works 100% correctly every time is impossible because testing a program for all possible types of computers, hardware, and additional software that may interfere with the way your program runs is impossible.
    A problem that keeps a program from working correctly is known as a bug. 
    In the early days, computers used mechanical relays and vacuum tubes instead of circuit boards and microprocessors. One day, the computer failed to work correctly. The scientists examined their program; it should have worked. So they next examined the computer itself and found that a moth had gotten smashed in a mechanical relay, preventing it from closing and thus keeping the computer from working correctly. From that point on, problems in computers have been known as bugs. 
    Because writing a program that works 100% correctly all the time is virtually impossible, 
    1. operating systems (such as Windows XP) unavoidably contain bugs that keep them from working correctly. 
    2. When you convert your source code into machine language, you must use a compiler or interpreter, which is another program that contains its share of bugs. 
    3. Finally, your own program may contain bugs of its own.
    With so many places for bugs to creep in, you shouldn’t be surprised that bugs infest computers like cockroaches infest a cheap apartment complex.

    Although you can do little about bugs in other people’s programs (except not buy the programs), you can reduce (but not completely eliminate) bugs in your own program by using a debugger. 
    A debugger is a special program (which may also contain bugs) that can help you track down and wipe out bugs in programs that you write.

    A debugger provides several ways to track down bugs in your program:
    • Stepping: The debugger runs your program, line-by-line, so that you can see exactly which line may contain the bug. This process is like rereading written instructions to get to another person’s house if you’re lost. By going over these instructions, one by one, you can find out where you made a wrong turn.
    • Breakpoints: Rather than force you to step through an entire program, line-by-line, breakpoints enable you to specify where you want to start examining your program line-by-line. So if you were lost, instead of rereading the instructions to get to another person’s house from start to finish, you skip those instructions that you know you followed correctly and examine only the remaining instructions that you aren’t sure that you followed correctly. Similarly, by using breakpoints, you can selectively examine only parts of your program, line-by-line, and skip over the parts that you know already work.
    • Watching: Watching enables you to see your program storing data in memory and to determine what that data may be. If your program stores incorrect data (such as saving a name instead of a telephone number), you know exactly where in your program the bug is occurring. Each time you examine a line in your program, the debugger shows you how that particular line affects the value you’re watching. As soon as you see the value change, the debugger shows you exactly which line in your program caused that change. This process is like having someone to tell you to drive 10 miles south down a certain road after turning right. The moment that you exceed 10 miles, a watchpoint alerts you so that you know exactly where you almost made a mistake and got lost.
    A debugger essentially shows you exactly how a computer is going to interpret the instructions in your program
    Of course, if you fix one bug, you may introduce several new ones. That’s why writing bug-free programs is impossible.

    Writing a Help File
    Nobody has trouble using a doorknob, a toaster, or a microwave oven, but people still complain that computers and VCRs are too hard to use. The problem with VCRs lies in the cryptic controls that aren’t easy to figure out just by looking at them.

    Similarly, the problem with computer programs is that programs are too complex to use at first glance. If you can make a program that’s actually easy to use, people can actually use it.
    Because computer programs are still being designed for programmers by other programmers, computers still mystify the average user. To help the poor befuddled user, most programs now offer Help files. 
    A Help file provides instructions and explanations on-screen. 
    Theoretically, if the user experiences trouble with the program, he can browse through the Help file, find an explanation or step-by-step instructions, and continue using the program.
    Although Help files still can’t substitute for designing a program that’s easy to use in the first place, most programs offer Help files anyway. 
    To keep your program modern and up-to-date, include a Help file with your program. 
    To create a Help file, you can use a special Help file-authoring program, which simplifies the process of creating and organizing topics for your Help file.

    Creating an Installation Program
    After you 
    1. write your program, 
    2. test it, 
    3. debug it, and 
    4. write a Help file for it, 
    the final step is to give or sell copies of your program to others. Although you can copy your program onto a floppy disk or CD/DVD and force buyers to manually copy your program to their hard drive, doing so can cause problems. Users may not copy all the files correctly, for example. And forcing users to manually copy your program to their hard drive may prove such a nuisance that most people don’t bother even trying to use your program.

    To make copying a program to a user’s hard drive as easy as possible, most commercial programs include a special installation program. 
    Users run this installation program, which automatically copies a program and all necessary files to the appropriate location on the user’s hard drive. 
    By making the installation of a program practically foolproof, software publishers make sure that users install the programs correctly.
    So the final step to distributing your program to others is to use a special installation program, which can smash all your program files into a single file that can automatically install itself on another computer.

    Installation programs offer the following features for distributing programs to others:
    • File compression: Most programs are fairly large, which means that they can’t fit on a single floppy disk. Rather than force users to insert a series of floppy disks or compact discs (CDs) into their computer to install your program, an installation program smashes your files so that they can fit on as few floppy disks or compact discs as possible.
    • Display graphics and play sounds: Installing a program usually takes a few minutes while the computer copies files from the floppy or CD to its hard drive. Rather than force the user to stare into space, an installation program can display advertisements or messages to make the installation process mildly interesting. Other uses for graphics include displaying an hourglass icon or a status bar that shows how far the installation is complete, such as 54% complete. That way, users know how much longer they need to wait.
    • Simplify the copying process: Most important, an installation program simplifies copying your program to the user’s hard drive, making the entire process accurate, fast, and foolproof.
    The first impression that people get from your program is through its installation process, so an installation program helps give your program a professional appearance. 
    Of course, you need to make sure that your installation program installs your program correctly or the first impression that users get from your program will be a highly negative one.