iematch
Title
iematch - Matching base observations towards target observations using on a single continuous variable.
Syntax
For a more descriptive discussion on the intended usage and work flow of this command please see the DIME Wiki.
iematch [if] [in] , grpdummy(varname) matchvar(varname) [ idvar(varname) m1 maxdiff(numlist) maxmatch(integer) seedok matchidname(string) matchdiffname(string) matchresultname(string) matchcountname(string) replace ]
options | Description |
---|---|
grpdummy(varname) | The group dummy variable where 1 indicates base observations and 0 target observations |
matchvar(varname) | The variable with a continous value to match on |
idvar(varname) | The uniquely and fully identifying ID variable. Used to indicate which target observation a base observation is match with. If omitted an ID variable will be created. See below if you have multiple ID vars. |
maxdiff(numlist) | Set a maximum difference allowed in matchvar(). If a base observation has no match within this difference then it will remain unmatched |
m1 | Allows many-to-one matches. The default is to allow only one-to-one matches. See the description section. |
maxmatch(integer) | Sets the maximum number of base observations that each target observation is allowed to match with in a m1 (many-to-one) match. |
seedok | Supresses the error message thrown when there are duplicates in matchvar(). When there are duplicates, the seed needs to be set in order to have a replicable match. The seed (see help set seed ) should be set before this command. |
matchresultname(string) | Manually sets the name of the variable that indicates if an observation was matched, or provide a reason why it was not. The default is _matchResult |
matchidname(string) | Manually sets the name of the variable that indicates which target observation each base observation is matched with. The default is _matchID |
matchdiffname(string) | Manually sets the name of the variable that indicates the differnece in matchvar() in each match pair/group. The default is _matchDiff |
matchcountname(string) | Manually sets the name of the variable that indicates how many observations a target observation is matched with in a many-to-one matches. The default is _matchCount |
replace | Replaces variables in memory if there are name conflicts when generating the output variables. |
Description
iematch matches base observations towards target observations in terms of nearest value in matchvar(). Base observations are observations with value 1 in grpdummy() and target observations are observations with value 0. For example, in a typical p-score match, base observations are treatment and target is control, however, there are many other examples of matching where it could be different.
iematch bases its matching algorithm on the Stable Marriage algorithm. This algorithm was chosen because it always provides a solution and allows stable matching even if multiple observations have identical match values (see seed() option for more details). One disadvantage of this algorithm is that it takes into account local efficiency only, and not global efficiency. This means that it is possible that other matching algorithms might generate a more efficient match in terms of the sum of the difference of all matched pairs/groups.
One-to-one matching algorithm
The algorithm used in a one-to-one match starts by evaluating which target observation each base observation is closest to and vice versa for each target observations. If a base and target observation pair mutually prefer each other, then these two observations are matched. The algorithm then repeats the initial two evaluation steps, and excludes observations after they are matched, until all base observations are either matched or excluded due to the option maxdiff(). Matched observations ends up in pairs of exactly one base observation and one target observation.
In a one-to-one match, there has to be at least as many target observations as there are base observations. A one-to-one match returns three variables. One variable indicates the result of the matching for each observation, such as, matched, not matched, no match within the max difference, etc. The other two variables hold information on the matched pair. One variable is the ID of the target observation in the matched pair, and the other variable is the difference in matchvar() between the two observations in the match pair. See the below in this section for more details on the variables generated.
Many-to-one matching algorithm
The algorithm used for many-to-one matching is the same as in a one-to-one match. But instead of matching only when there is a mutual preference, all base observations are matched towards their preferred target observation in the first step, as long as the match is within the max-value if maxdff() is used. Matched observations end up in groups in which there is exactly one target observation but where there are either one or several base observations.
In a many-to-one match, there is no restriction in terms of the number of base observations in relation to target observations. The many-to-one matching yields four variables. Three of the variables are the same as in one-to-one matching, and only the variable listing the difference The fourth variable indicates how many base observations that were matched towards each target observation. See the below in this section for more details on the variables generated.
Variables generated
This section explains the variables that are generated by this command. All variables will be referred to by their default names, but those names can be manually set to something else (see the options for this command).
_matchResult
: This variable indicates the result of the match for all observations. Observations that ended up in a match pair/group has the value 1. Target observations not matched have the value 0. Base observations without a valid match due to maxdff() are assigned the value .d. Observations excluded from the match using if/in are assigned the value .i. Observations excluded from the match due to missing value in grpdummy() are assigned the value .g. Observations excluded from the match due to missing value in matchvar() are assigned the value .m. All values have descriptive value labels.
_matchID
: This variable indicates the ID of the target observations in each match pair/group. For matched target observations, the value in _matchID will be equal to the value in the ID variable. Since the values in the ID variable are unique and there is exactly one target observation in each matched pair/group, this variable functions as a unique pair/group ID. In addition to indicating which observations are included in the same pair/group, this variable can be used to include a pair/group fixed effect in a regression.
_matchDiff
: This variable indicates the difference in matchvar() between matched base observations and target observations. In a one-to-one match this value is the identical for both the base observation and the target observation in each matched pair. In a many-to-one match, this value is only indicated for base observations. It is missing for target observations as there are potentially multiple matches, and subsequently multiple differences.
_matchCount
: In a many-to-one match, this variable indicates how many base observations were matched towards each matched target observation. This variable can be used as regression weights when controlling for the fact that some observations are matched towards multiple observations.
Options
grpdummy(varname) is the dummy variable that indicates if an observation is a base or target observation. This variable, must be numeric and is only allowed to have the values 1, 0 or missing. 1 indicates a base observation, 0 indicates a target observation, and observations with a missing value will be excluded from the matching.
matchvar(varname) is the variable used to compare observations when matching. This must be a numeric variable, and it is typically a continuous variable. Observations with a missing value will be excluded from the matching.
idvar(varname) indicates the variable that uniquely and fully identifies the data set. The values in this variable is the values that will be used in the variable that indicates which target observation each base observations matched against. If this option is omitted, a variable called _ID will be generated. The observation in the first row is given the value 1, the second row value 2 and so fourth.
This command assumes only one ID variable as that is the best practice this command follows (see next paragraph for the exception of panel data sets). Here follows two suggested solutions if a data set this command will be used on has more than one ID variable. 1. Do not use the idvar()
option and after the matching copy the multiple ID variables yourself. 2. Combine your ID variables into one ID variable. Here are two examples on how that can be done (the examples below work just as well when combining more than two ID variables to one.):
egen new_ID_var = group(old_ID_var1 old_ID_var2)
gen new_ID_var = old_ID_var1 + "_" + old_ID_var2 //Works only with string vars
Panel data sets are one of the few cases where multiple ID variables is good practice. However, in the case of matching it is unlikely that it is correct to include multiple time rounds for the same observation. That would lead to some base observations being matched to one target observation in the first round, and one another in the second. In impact evaluations, matchings are almost exclusively done only on the baseline data.
maxdiff(numlist) sets a maximum allowed difference between a base observation and a target observation for a match to be valid. Any base observation without a valid match within this difference will end up unmatched.
m1 sets the match to a many-to-one match (see description). This allows multiple base observations to be matched towards a single target observation. The default is the one-to-one match where a maximum one base observation is matched towards each target observation. This option allows the number of base observations to be larger than the number of target observations.
maxmatch(integer) sets the maximum number of base observations a target observation is allowed to match with in a m1 (many-to-one) match. The integer in maxmatch() is the maximum number of base observations in group but there is also a always a target observation in the group, so in a maxed out match group it will be maxmatch() plus one observations.
seedok supresses the error message thrown when there are duplicates among the base observations or the target observations in matchvar(). When there are duplicates a random variable is used to distinguish which variable to match. Setting the seed makes this randomization replicable and thereby making the matching also replicable. The seed (see help set seed
) should be set before this command by the user. When the seed is set, or if replicable matching is not important, then the option seedok can be used to suppress the error message. Duplicate pairs where one observation is a base observation and the other is a target observations are allowed.
matchresultname(string) manually sets the name of the variable that reports the matching result for each observation. If omitted, the default name is _matchResult
. The names _ID
, _matchID
, _matchDiff
and _matchCount
are not allowed.
matchidname(string) manually sets the name of the variable that list the ID of the target observations in each match pair/group. If omitted, the default name is _matchID
. The names _ID
, _matchDiff
, _matchResult
and _matchCount
are not allowed.
matchdiffname(string) manually sets the name of the variable that indicates the difference between the matched base observations and target observations. If omitted, the default name is _matchDiff
. The names _ID
, _matchID
, _matchResult
and _matchCount
are not allowed.
matchcountname(string) manually sets the name of the variable that indicates the number of base observations that has been matched towards each matched matched target observation. This option may only be used in combination with option m1. If omitted, the default name is _matchCount
. The names _ID
, _matchID
, _matchDiff
and _matchResult
are not allowed.
replace allows iematch to replace variables in memory when encountering name conflicts while creating the variables with the results of the matching.
Examples
Example 1
iematch , grpdummy(tmt) matchvar(p_hat)
In the example above, the observations with value 1 in tmt will be matched towards the nearest, in terms of p_hat, observations with value 0 in tmt.
Example 2
iematch if baseline == 1 , grpdummy(tmt) matchvar(p_hat) maxdiff(.001)
In the example above, the observations with value 1 in tmt will be matched towards the nearest, in terms of p_hat, observations with value 0 in tmt as long as the difference in p_hat is less than .001. Only observations that has the value 1 in variable baseline will be included in the match.
Example 3
iematch , grpdummy(tmt) m1 maxmatch(5) matchvar(p_hat) maxdiff(.001)
In the example above, the observations with value 1 in tmt will be matched towards the nearest, in terms of p_hat, observations with value 0 in tmt as long as the difference in p_hat is less than .001. So far this example is identical to example 2. However, in this example each target observation is allowed to match with up to 5 base observations. Hence, instead of a result with only pairs of exactly one target observation and one base observation in each pair, the result is instead match groups with one target observation and up to 5 base observations. If maxmatch() is omitted any number of base observations may match with each target observation.
Acknowledgements
I would like to acknowledge the help in testing and proofreading I received in relation to this command and help file from (in alphabetic order):
Luiza Cardoso De Andrade
Seungmin Lee
Mrijan Rimal
Feedback, bug reports and contributions
Please send bug-reports, suggestions and requests for clarifications writing “ietoolkit iematch” in the subject line to: dimeanalytics@worldbank.org
You can also see the code, make comments to the code, see the version history of the code, and submit additions or edits to the code through GitHub repository for ietoolkit
.