Understanding what files are and choosing a Delphi file type - part 1

What is a File?  How are they stored? What format is best for my project? - The first part of a series by Philip Rayment


File, n, 1. A metal tool with numerous small cutting ridges or teeth on its surface, for smoothing or cutting metal and other substances. 2. A cabinet in which papers, etc., are arranged or classified for convenient reference. 3. Computers, a portion of a memory storage device allocated to a set of data.

If you want explanations of the first two definitions, sorry you will have to look elsewhere. This article discusses the third definition, although of course a computer file is analogous to the cabinet of the second definition.

This article discusses what a file is by looking at the origins of files; at least in so far as they have been implemented on PCs.

Outline of this article:

  • Disk format of a file
  • File contents, part I
  • Language conventions and ASCII
  • Language conventions and machine code
  • File contents, part II
  • Delphi and files
  • Which file types should you use?
Disk format of a file

A file is a portion of a disk (or equivalent device) allocated to a set of data and referred to by a file name. With FAT file systems, disk space is allocated in blocks of (for example) 256, 1024, or 4096 bytes, depending on the capacity of the disk. A disk with 512-byte blocks will therefore allocate 512 bytes of storage for any file up to that size. If you create a file that only requires 6 bytes, 512 bytes will be allocated. If you create a file 513 bytes long, 1024 bytes (two blocks) will be allocated. So how does the operating system know the actual size of the file? Each disk keeps a directory of files. The entries for each file include the name of the file, the date and time the file was last written to, and the size of the file. It is an enhanced version of this directory that Windows Explorer presents in the 'Files' pane. This system has been around since the very first version of MSDOS, and in fact was based on an even earlier operating system, known as CP/M.

File contents, part I

So what goes into a file? Anything, actually. Files are sequences of bytes. A byte is of course 8 bits where each bit can, by definition, have one of two values, which can be represented as on and off, zero and one, or any other representation desired. Normally bits are represented by the numeric digits 0 and 1, and eight identical bits therefore can be represented as 00000000 or 11111111.

These are binary numbers, but are not convenient for most purposes, so are often combined into groups of four bits. Because a group of four bits can have any of 16 different values, these are normally represented by the ten numeric digits and the first six letters of the alphabet. This is known as hexadecimal. Of course decimal numbers can also represent 16 different values.

The table at right shows these 16 different values represented as binary, hexadecimal, and decimal.

 

Binary

Hexadecimal

Decimal

0000

0

00

0001

1

01

0010

2

02

0011

3

03

0100

4

04

0101

5

05

0110

6

06

0111

7

07

1000

8

08

1001

9

09

1010

A

10

1011

B

11

1100

C

12

1101

D

13

1110

E

14

1111

F

15

As a byte is eight bits, two hexadecimal digits are used to represent the value, giving values from 00 to FF, equivalent to 00000000 to 11111111 in binary or 0 to 255 in decimal.   Delphi distinguishes hexadecimal from other numbers by the dollar sign at the start of the number, thus 40 is a decimal number whereas $40 is hexadecimal number (equivalent to 64 in decimal). But of course files don't just contain numbers, do they. They can contain text, pictures, etc., as well. How do they do this? The answer lies in what can be termed language conventions.

Language conventions and ASCII

What does the sequence of letters 'c', 'a', and 't' mean? To English-speaking people, it is a furry pet with claws. The letters themselves have no inherent meaning, but English speakers agree to apply a particular meaning to that particular sequence of letters. Similarly, the sequence 'g', 'i', 'f', and 't' mean a present. But to German-speaking people that sequence means a poison. The same sequence of letters can mean different things to different people, and in fact any sequence of letters or other symbols can mean anything at all, as long as the writers and readers all understand the meaning.

English and similar languages use 26 letters, Morse code uses two, and DNA uses four. In the 1960s a language convention was adopted for computer data, known as ASCII (American Standard Code for Information Interchange). This convention allocated meanings to the first 128 of the 256 values a byte can have. There were already other conventions in use, and others again have modified or superseded ASCII, but ASCII was adopted by personal computers when they appeared and so it became quite widespread.

Under the ASCII standard or language convention, the value 01000001/$41/65 was given the job of representing the capital letter 'A'. Thus a file that contained the bytes $43, $41, and $54 will, if loaded into WordPad, display as 'CAT'. This is not because the file contains the word 'CAT', but because WordPad understands the bytes to represent the letters 'C', 'A', and 'T'. A different application may understand the same bytes differently. So if ASCII uses 128 different values and English only has 26 letters, what are all the others for?

Well English actually uses more than just the 26 letters. It uses both capital and lower-case letters, a space to separate words, and there other symbols to help with clarity, such as commas, full stops, question marks, etc. ASCII uses 95 of the values to represent the ten numeric digits, 26 capital letters, 26 lower case letters, various punctuation marks, the space character, and miscellaneous other symbols such as the dollar sign and '@' symbol. ASCII also defines 32 control characters. These were originally designed for data transmission and similar where specified values indicate the start and end of transmission, etc. Thus value 3 was ETX (End of Text) and 4 was EOT (End of Transmission). $A (10) is LF (Linefeed) and $C (13) is CR (Carriage Return). Most of these control characters (values $0 to $19) are not used as such in PCs. IBM also decided to allocate the remaining 128 characters (values $80 to $FF) to various mathematical symbols and foreign-language characters, but these are not part of the ASCII standard and under Windows different typefaces may allocate different symbols to these values.

Language conventions and machine code

There is another important language convention used on IBM-type PCs. This is the language convention of the processor itself. To the processor, the value $41 is not the letter 'A', but the instruction inc ecx (increment the ecx register). The processor understands the byte values to be instructions to perform, and these have no connection with the ASCII code at all. Thus the same "letters" can represent two or more completely unrelated ideas, just as gift means something totally different in English and German.

File contents, part II

So computer files contain sequences of bytes which may represent ASCII characters or machine code or something else altogether. So how does the Operating System (OS) know what the values represent? In a sense, it doesn't. It really doesn't matter to the OS what a file contains. A file is ANY sequence of byte values. If all it is asked to do is to copy, move, or delete the file, the contents don't matter at all.

If Explorer is told to open a file, it looks up a list (based on the filename extension) to see which application to pass the file to, starts that application, and passes the file to it. It has no idea whether the file actually contains what the application expects it to contain. About the only time the file contents matter to the OS (apart from its own files) is when it is asked to run the file as a program. In this case it will check to see if the filename extension is an appropriate one (.exe, .com, etc.), but in most cases it also checks the contents of the file to see if they have certain signature values.

Early .exe programs, for example, had to start with the bytes $4D and $58. These bytes did not represent machine code, but were an indication (by yet another convention) that the file was a program. (The values $4D and $58 were arbitrarily chosen as in ASCII they represent the letters 'MZ', reputedly the initials of the programmer who designed the .exe file format!) In CP/M days, files were saved in 128-byte blocks with no record of the exact file size. The actual end of a text file was marked with a byte with a (decimal) value of 26 (also known as Ctrl-Z).

Delphi file types

Delphi provides several methods for handling files, including wrappers for Windows' own file-handling methods. I will not cover the latter here. Delphi categorises files as untyped, typed, and text. The most basic is the untyped file, with which Delphi treats the file merely as a sequence of byte values. This essentially is what is done with the following procedure, which makes a copy of a file.

procedure CopyFile(fromName,toName:string);
var
  infile, outfile: file;
  buffer: pointer;
  fs: integer;
begin
  assignFile(infile, fromName); reset(infile,1);
  assignFile(outfile, toName); rewrite(outfile,1);
  fs:=FileSize(infile);
  getmem(buffer,fs);
  blockread(infile,buffer^,fs);
  blockwrite(outfile,buffer^,fs);
  CloseFile(infile);
  CloseFile(outfile);
  Freemem(buffer,fs);
end;

This rather simple procedure reads the entire contents of the file into the memory allocated to buffer then writes the same data to a new file. It assumes nothing about the contents of the file. Actually, for historical reasons (probably traceable back to the CP/M file record-size), Delphi assumes that an untyped file is composed of blocks of 128 bytes unless you specify a different size in the reset and rewrite procedures. In the code above, we have specified record sizes of one byte, then told Delphi to read and write fs "records". Unless you have a special reason for not doing so, you should always specify a record size of one byte when using untyped files. With a typed file, you tell Delphi what the file contains. This may be sequences of bytes, words, booleans, or a user-defined type such as a record. This last one is often referred to as a file of record. The following procedure also copies a file, but tells Delphi that the file contains MyRecord records.

type MyRecord = packed record
  Surname: string[20];
  ChristianName: string[20];
  Birthdate: TDate;
end;   {MyRecord}

procedure CopyFile(FromName,ToName);
var
  InFile, OutFile: file of MyRecord;
  Rec: MyRecord;
begin
   AssignFile(InFile, FromName); reset(InFile);
   AssignFile(OutFile, ToName); rewrite(OutFile);
   while not eof(InFile) do begin
     read(InFile, rec);
     write(OutFile, rec);
   end;   {while}
   CloseFile(InFile);
   CloseFile(OutFile);
end;

Delphi knows that a MyRecord type occupies 50 bytes (21 for each string field and eight for the Birthdate field), so reads in and writes out 50 bytes at a time. If the file is not a multiple of 50 bytes, an error will occur when the end of the file is reached in the middle of reading a record. The following code does the same thing but uses an untyped file (it uses the same MyRecord as the previous example):

procedure CopyFile(FromName,ToName);
var
  InFile, OutFile: file;	{untyped file this time}
  Rec: MyRecord;
begin
  {specify “records” of the length of MyRecord}
  AssignFile(InFile, FromName); reset(InFile,sizeof(MyRecord)); 
  AssignFile(OutFile, ToName); rewrite(OutFile,sizeof(MyRecord));
  while not eof(InFile) do begin
    BlockRead(InFile, rec, 1);	 {read one record}
    BlockWrite(OutFile, rec, 1); {write one record}
  end;   {while}
  CloseFile(InFile);
  CloseFile(OutFile);
end;

The remaining file type that Delphi understands is TextFile. This indicates to Delphi that the file contains bytes conforming to the ASCII language convention, although it will accept non-ASCII characters, i.e. characters in the range $80 to $FF. Particularly, it does assume that the file contains lines of text separated by CR (Carriage Return) characters, possibly followed by LF (Line Feed) characters. The following procedure copies a text file:

procedure CopyFile(FromName,ToName);
var
  InFile, OutFile: textfile;	
  S: string;
begin
  AssignFile(InFile, FromName); reset(InFile);	
  AssignFile(OutFile, ToName); rewrite(OutFile,sizeof(MyRecord));
  while not eof(InFile) do begin
    Readln(InFile, s); {read an entire line up to a CR character.  The CR (and LF) is skipped.}
    Writeln(OutFile, s); {write a line and append CR and LF}
  end;   {while}
  CloseFile(InFile);
  CloseFile(OutFile);
end;

A text file gives you other options. One option is to read and write partial lines (use Read and Write instead of ReadLn and WriteLn). Another is to automatically convert certain ASCII sequences to their numerical equivalents. For example, given "i" being declared as a byte, if the file contains the string '123 ', read(InFile, i) will convert the string into the numeric value $7B (123 in decimal). Delphi also defines the TiniFile object which assumes that the file is a text file conforming to the layout of a Windows .ini file, wherein most lines are of the form <keyname>=<value>. Additionally the TStrings type has methods for reading and writing text files. Then there are database files, which are beyond the scope of this article (because I haven't used them and don't know much about them!).

Click here to read the next part of this article series

Part 1 Part 2 >> Part 3>>