File Format
Each data file we host has a unique identifier in the format [XX]-YYYYY-ZZZZZZZZ.EXT. These numbers are broken down as:
- XX is a 2 letter category code that is one of ED (election data), MD (matching data) of CD (combinatorial or rating data).
- YYYYY is a 5 digit Series Code which specifies the source of the data. This is the dataset identifier.
- ZZZZZZZZ is an 8 digit Element Number for each individual file of a series. This is the data patch identifier.
- EXT which is a unique file extension describing the type of data in the file. See the data types page for more details about the extension.
We have developed a small set of lightweight tools in Python3 for working with PrefLib and generating synthetic data. Please download the current version of the tools below and check the README for full details. PrefLib tools are covered under the BSD License and is available at the PrefLib-Tools GitHub Page.
We are currently using 3 formats that are described below.
Election Data
The format for all ranked preferences (orders over candidates or sets of candidates) is as follow with each element being on a new line. The file extensions SOC, SOI, TOC, TOI, TOG, MJG, WMG and PWG use this format.
- Number of Candidates
- 1, Candidate Name
- 2, Candidate Name
- ...
- Number of Voters, Sum of Vote Count, Number of Unique Orders
- Count, Preference list
- Count, Preference list
- ...
Votes are sorted by count in the individual data files. Each field is described below:
- Number of Candidates: the number of candidates or alternatives agents vote over.
- X, Candidate Name: the name of the candidate or the alternative labeled by the number X.
- Number of Voters: the number of actual ballots that were cast.
- Sum of Vote Count: the sum of weights. In most cases the Number of Voters = Sum of Vote Count, except where for example, we have induced a relation like generating a pairwise graph from a set of linear orders. In this case we would have some number n of voters over m alternatives but we would have n * m choose 2 as the sum of weights since each voter expresses a relation between each pair of elements.
- Number of Unique Orders: the number of distinct ballots that were cast.
- X, Preference list: the preference list together with the number of agents who submitted it.
- Preference list: a strict ordering indicated by comma and elements that are indifferent are grouped with brackets.
Here is an example of the 25 first lines of a data file of complete orders with ties (TOC) (taken from the debian election dataset).
1 | 4 |
2 | 1, Branden Robinson |
3 | 2, Raphael Hertzog |
4 | 3, Bdale Garbee |
5 | 4, None Of The Above |
6 | 475, 475, 41 |
7 | 60, 3, 1, 2, 4 |
8 | 50, 1, 3, 2, 4 |
9 | 40, 3, 1, 2, 4 |
10 | 34, 3, 2, 1, 4 |
11 | 31, 3, 2, 4, 1 |
12 | 29, 2, 3, 1, 4 |
13 | 29, 1, 3, 2, 4 |
14 | 24, 2, 1, 3, 4 |
15 | 22, 1, 2, 3, 4 |
16 | 20, 3, 2, 1, 4 |
17 | 15, 1, 3, 4, 2 |
18 | 14, 2, 3, 1, 4 |
19 | 11, 3, 1, 4, 2 |
20 | 9, 2, 3, 4, 1 |
21 | 9, 3, {1, 2, 4} |
22 | 8, 1, 2, 3, 4 |
23 | 7, 1, {2, 3, 4} |
24 | 5, 3, 4, {1, 2} |
25 | 5, 3, 2, {1, 4} |
Weighted Matching Data
The format for all weighted matching data is as follow with each element being on a new line. Only the file extensions WMD uses this format.
- Number of Nodes, Number of Edges
- 1, Node Name
- 2, Node Name
- ...
- Source, Destination, Weight
- Source, Destination, Weight
- ...
The edges are sorted by source so that all edges starting from the same source are grouped together. Each field is described below:
- Number of Nodes: the number of nodes in the graph.
- Number of Edges: the number of edges in the graph.
- X, Node Name: the name of the node labeled by the number X.
- Source, Destination, Weight: an edge represented by its source node, its destination node and its weight.
Here is an example of the 25 first lines of a weighted matching data file (WMD) (taken from the kidney matching dataset).
1 | 16, 26 |
2 | 1, Pair 1 |
3 | 2, Pair 2 |
4 | 3, Pair 3 |
5 | 4, Pair 4 |
6 | 5, Pair 5 |
7 | 6, Pair 6 |
8 | 7, Pair 7 |
9 | 8, Pair 8 |
10 | 9, Pair 9 |
11 | 10, Pair 10 |
12 | 11, Pair 11 |
13 | 12, Pair 12 |
14 | 13, Pair 13 |
15 | 14, Pair 14 |
16 | 15, Pair 15 |
17 | 16, Pair 16 |
18 | 0, 4, 1 |
19 | 2, 7, 1 |
20 | 2, 3, 1 |
21 | 3, 4, 1 |
22 | 5, 7, 1 |
23 | 5, 3, 1 |
24 | 6, 7, 1 |
25 | 6, 3, 1 |
Extra Data File
When miscellaneous data are needed, we use the file extension DAT which always is a simple CSV file with headers.
- Item 1 Name, Item 2 Name, Item 3 Name, ..., Item N Name
- Item 1 Value, Item 2 Value, Item 3 Value, ..., Item N Value
Files with a dat extension are generally paired with another file, providing more information than is expressible in the basic data formats.