# Reliable way to determine package which own/update a file

## jonathan183

I'm struggling to determine the packages which own or update files.

An example of a file which has no owner identified using equery is /lib/cpp I think it's owned by either gcc or gcc-config but it's not listed a file in either package (reported by equery).

I have tried comparing modify and change times report by stat with emerge log with date/times which match part way through emerge of gcc.

I also have files which are owned by a package for example /usr/share/vim/vim73/docs/tags which is owned by app-editors/vim-core-7.3.762 which has an md5sum which does not match.

I'm trying to confirm files are correct before updating the tripwire database.

I have tried grep -r 'lib/cpp' /usr/portage which pulls up a few matches in Changelog information.

Can anyone give me a clue of a better way of identifying packages which modify files, I don't want to have to resort to running something like tripwire after every package ... but the detective work of trying to work out whats modifying particular files is also time consuming.

----------

## Anon-E-moose

Not sure why that doesn't show up, but cpp is part of gcc.

It doesn't even show up when you do an "equery f gcc" not sure why, it has the same timestamp as the time gcc was emerged.

Something wrong with either portage or the gcc ebuild.

```
 ~ $ dir /usr/bin/cpp

-rwxr-xr-x 1 root root 10384 Jan 26 05:39 /usr/bin/cpp

 ~ $ dir /lib/cpp

-rwxr-xr-x 1 root root 10384 Jan 26 05:39 /lib/cpp
```

----------

## krinn

You'll only see files that are part of a package, any file not part of it won't show up.

so, user create configuration file, or for /lib/cpp a file create by the package itself, but not in the package list.

----------

## Hu

That file is unowned because it is a copy of the installed gcc-config helper binary.  This could have been done as a symlink to the helper, which would be a bit clearer.

----------

## jonathan183

Thanks for the responses, so far it looks as though I'm going to be stuck with either a re-run of tripwire after every package update or a Columbo job on tracking down which package did the update ... are there really no more reliable/convenient methods of doing this?

----------

## Hu

For files not managed by the package manager, there is no general way to automatically verify them.  For most files, they ought to be under package manager control.  The ones you cited are not, for various reasons.

----------

## Anon-E-moose

/usr/bin/cpp and /lib/cpp are the same and should keep the same timestamp and md5 (or whatever) sig

----------

## jonathan183

I can understand things which I generate or I have to take action after the package manager has finished ... things like the kernel and grub, but I was hoping something would be keeping track of things installed by packages - whether it's a copy of another file or not. There are also about 80 symbolic links in /usr/bin or /bin get reported as no package owner things like /usr/bin/vi linked to vim.

----------

## cboldt

/lib/cpp is installed by gcc-config.  As reported above, it is a copy of /usr/bin/cpp.  The copy is created by a script at ...

/usr/portage/sys-devel/gcc-config/files/gcc-config-*

As for finding all the files that are installed "without a trace," good question!  gcruft.pl is designed to pick up files that aren't listed in the /var/db/pkg tree (which is another good place to run `grep -r [targetfile] /var/db/pkg/*).   gcruft.pl uses the string "$libmap", which is plain old "lib" on my system.  /$libmap/cpp is listed as an expected unaccounted file, in /usr/share/gcruft/exceptions/sys-devel/gcc-config.pl

Even with its built in list of exceptions (files expected to be unaccounted by portage), gcruft.pl threw up a list of some 250 files (not counting all the expected files in /usr/src and /lib/modules) that were not installed by a package, but do belong on my system.

How did you find the /usr/share/vim/vim73/doc/tags md5 mismatch?  I would wonder if that file is one that undergoes changes in some routine fashion (i.e., just use of vim), or if the vim-base documentation package and help file navigation might be modified by some vim-plugin package.

----------

## cboldt

Based on use of gcruft.pl (and before that, findcruft), I know that there is a substantial number of expected unaccounted files in a typical Gentoo installation, beyond the files that you know you are creating after a package is installed.  One can find how those files came to be, that's easy enough, albeit time consuming.  Given the allowed flexibility in ebuild->patch->install, I don't think Gentoo is designed so that all automatically installed files and links even can be accounted for in the /var/db/pkg tree

If you thought all the installed files were accounted for, now you know the facts are otherwise.

----------

## khayyam

 *cboldt wrote:*   

> How did you find the /usr/share/vim/vim73/doc/tags md5 mismatch?  I would wonder if that file is one that undergoes changes in some routine fashion (i.e., just use of vim), or if the vim-base documentation package and help file navigation might be modified by some vim-plugin package.

 

cboldt ... the former, when you install say 'app-vim/eselect-syntax' or 'app-vim/gentoo-syntax' then additional $VIMRUNTIME/doc/*.txt are installed (actually, a sym-link but the result is the same). When vim is run the 'tags' for ':help' are updated (actually, as I remember its the call to :help that runs :helptags). See :help helptags.

best ... khay

----------

## jonathan183

 *cboldt wrote:*   

> How did you find the /usr/share/vim/vim73/doc/tags md5 mismatch?  I would wonder if that file is one that undergoes changes in some routine fashion (i.e., just use of vim), or if the vim-base documentation package and help file navigation might be modified by some vim-plugin package.

 

The short answer is I run tripwire from a script and redirect output to files. Relevant parts of the script

I redirect the tripwire scan output to a file 

```
sudo tripwire --check 1> tripwire-latest.txt 2> tripwire-latest-error.txt
```

Then when I want to investigate files which have issues I use the investigative bit of the script.

```
       sudo equery -N -C check $(\

                                equery -q -C b $(\

                                                        cat tripwire-latest.txt | grep '"/' | tr -d '"' | tee tripwire_files_to_check_full_list.txt \

                                                ) | sort | uniq  \

        ) 1> /dev/null 2> tripwire_files_to_check_failed_package_checks.txt

   # and now we capture files which dont belong to a package ...

   qfile b $( cat tripwire_files_to_check_full_list.txt | sort | uniq ) | awk '{ print $2 }' | tr -d '(' | tr -d ')' | sort | uniq > tripwire_files_to_check_owned_by_package.txt

   # now lets list the differences

   sort tripwire_files_to_check_full_list.txt tripwire_files_to_check_owned_by_package.txt | uniq -u > tripwire_files_to_check_no_package_owner.txt
```

I end up with three files containing lists, one a list of all the files where there are issues identified by tripwire (line with / in report), redirected error from equery for package check which gives md5sum failes etc, and a difference between the full file list and files which are owned by a package for unowned packages.

The vim file is updated by another viim package which does not claim to own it ... I'm not sure if it was gentoo-syntax or vim rather than vim-core which modifies the file.

Currently the full file list has 5815 lines, failed package checks has 17 lines, no package owner 2154 lines. For the no owner 1194 are in /lib/modules - since I installed ck kernel and a later gentoo-sources thats OK, then 299 for grub, 458 deleted files - leaving 203 to check of which 101 are symbolic links.

Ed: I switched to using read-only mounted root and separate partitions for /usr /var /tmp since I last updated the tripwire database which is likely to have identified a larger number of files changed (as inodes will have changed), but I don't think that should have any influence on whether files are owned by a package.Last edited by jonathan183 on Mon Mar 03, 2014 10:53 pm; edited 1 time in total

----------

## cboldt

Just tinkering around with your script snippets, and will share some ways to speed things up.  Rather than start with the tripwire database to find out installed packages, this integrity check first finds all the installed packages by view of /var/db/pkg ...

```
equery -N -C check `ls -d /var/db/pkg/*/* | cut -d/ -f6` 2> failed-equery-check.txt
```

That code checks all the files for all the packages installed, and gives progress while running.  I'm sure that most of the files checked this way are -NOT- in the tripwire database, and as you found out, even this check does not disclose all of the files related to installed packages.

Rather than run `tripwire --check`, which gives variable results depending on which files fail the tripwire check, this snippet lists all the files in the tripwire database.  The tripwire database on the machine I am playing on right now has 8427 files in it.  I would guess fewer than 100 of these are directories.

```
twprint --print-dbfile | grep "^ /"
```

To find if a file is NOT owned by an installed package, first concatenate the contents of all the /var/db/pkg/*/*/CONTENTS files into one file.  The all-package-files.txt file, here, is about 23Mb in size ...

```
cat /var/db/pkg/*/*/CONTENTS > all-package-files.txt
```

Now, to find which of the files in the tripwire database is not owned by an installed package ...

```
for i in `twprint --print-dbfile | grep "^ /"`; do

  grep -q " $i " all-package-files.txt

# file name not found in any of CONTENTS

# file is not a directory

  if [ $? != 0 -a ! -d $i ]; then

# handling of softlinks

    if [ -h $i ]; then

# $j will be canonical name of linked file

# $j is searched for in all-package-files.txt

# if the linked file is found, skip report

      j=`readlink -f $i`

      grep -q " $j " all-package-files.txt || echo $i -\> $j

# handling of files that are not softlinks

    else echo $i

    fi

  fi

done

```

This method is quite a bit faster compared with `equery belongs`.

None of that resolves the issue of figuring out whether or not an unaccounted file belongs, and if it belongs, whether or not it is in good order.  I would think the snippets above give about the same list that your script does - a couple hundred mystery files.

----------

## khayyam

 *cboldt wrote:*   

> 
> 
> ```
> equery -N -C check `ls -d /var/db/pkg/*/* | cut -d/ -f6` 2> failed-equery-check.txt
> ```
> ...

 

cboldt ... rather than search files in /var/db/pkg you could use app-portage/eix to provide the package list ... eg,

```
equery -N -C check $(eix -Ic --only-names) 2> failed-equery-check.txt
```

 *jonathan183 wrote:*   

> 
> 
> ```
> cat tripwire-latest.txt | grep '"/' | tr -d '"' | tee tripwire_files_to_check_full_list.txt
> ```
> ...

 

jonathan183 ... you should try to avoid using 'cat' like this (called UUoC, "useless use of cat"), its an extra call and pipe, and as 'grep' can be provided the filename as input its "useless".

```
grep '"/' tripwire-latest.txt | tr -d '"' | tee tripwire_files_to_check_full_list.txt
```

I [::heart::] awk and so I'd probably do the following (though, I'm not entirely sure what the input is in the above) ...

```
awk '/\"\//{gsub(/\"/,""); print}' | tee tripwire_files_to_check_full_list.txt
```

HTH & best ... khay

----------

## steveL

 *khayyam wrote:*   

> 
> 
> ```
> grep '"/' tripwire-latest.txt | tr -d '"' | tee tripwire_files_to_check_full_list.txt
> ```
> ...

 

If sed can do it, sed is quicker, ime. 

```
sed -n '/"/{s///g;p;}' "$file"
```

----------

## cboldt

Good tip on use of eix to get a list of all installed packages.  It's fast too, only about a half second longer compared with the`ls | cut` combination.   On the eix command, `eix -I#` is enough, the -c switch isn't needed when the -# (--only-names) switch is used.

I think my concatentation of all the package-installed files into all-package-files.txt is unnecessary, too.  While `grep [string] /usr/portage/*/*/somefile` fails because the resulting argument is too long, `grep [string] /var/db/pkg/*/*/CONTENTS` works, at least it does on my systems.

I found it pretty easy to resolve the source of most of the mystery/orphan files, and the reason for the failed md5sum files too.  Many are configuration changes, some are added configuration files, some files are added or changed by normal operation of the associated programs.  I estimate the number of files that need additional inquiry (if one is determined to know where they came from) is 20-50.

Not that the study can be considered fully comprehensive.  Using the methods discussed, the list of mystery/orphan files is limited by the tripwire policy file, which decides the files to be monitored by tripwire; and the list of md5sum errors is limited to the list of files in /var/db/pkg/*/*/CONTENTS.  There are files outside of both of those sets, files in one list but not in the other, and files in both lists.

----------

## khayyam

 *steveL wrote:*   

>  *khayyam wrote:*   
> 
> ```
> awk '/\"\//{gsub(/\"/,""); print}' | tee tripwire_files_to_check_full_list.txt
> ```
> ...

 

steveL ... indeed, many cats to way a skin, but thats the least of my problems ... note I'm missing the input file in the above ... doh!

 *cboldt wrote:*   

> I think my concatentation of all the package-installed files into all-package-files.txt is unnecessary, too. While `grep [string] /usr/portage/*/*/somefile` fails because the resulting argument is too long, `grep [string] /var/db/pkg/*/*/CONTENTS` works, at least it does on my systems.

 

cboldt ... hmmmm ... why not use find in that case?

```
find /var/db/pkg/ -name "CONTENTS" -exec grep string {} +
```

Something similar came up in another thread where the md5sum was checked from CONTENTS ... maybe of some interest.

best ... khay

----------

## cboldt

`find -exec {} +` is a great tool.  The start of the script then would become:

```
for i in `twprint --print-dbfile | grep "^ /"`; do

   find /var/db/pkg/ -name "CONTENTS" -exec grep -q " $i " {} +

...
```

It's a little slower than `grep -q " $i " /var/db/pkg/*/*/CONTENTS`, a tenth of a second maybe, times however many files are in the tripwire database.  But the use of find would prevent failures in case the /var/db/pkg/*/*/CONTENTS argument got too long.

----------

## mv

 *cboldt wrote:*   

> Even with its built in list of exceptions (files expected to be unaccounted by portage), gcruft.pl threw up a list of some 250 files (not counting all the expected files in /usr/src and /lib/modules) that were not installed by a package, but do belong on my system.

 

OT, but since you mentioned it: You might want to have a look at find_cruft of the mv overlay. It has a default list of exceptions but allows for rather generic configuration files to modify/extend it to your purpose. The problem is that people consider different things as "cruft" (/usr/src or some files under /var are such examples). [advertisement] Also, I think that the configurable handling of symlinks (so that a file in /lib64/Bla.so but recorded in the database as /lib/Bla.so will not considered as craft if /lib is a symlink to /lib64) is a rather unique feature of find_cruft [/advertisement]

----------

## mv

 *cboldt wrote:*   

> Good tip on use of eix to get a list of all installed packages.

 

Be aware that eix -I is not always what you want (as described in the manpage): eix -I will show only those packages of its database which are simultaneously installed. If the package is no longer in the tree or in any installed overlay (or if the database is out-of-date) you will not see it (however, eix -tI will spit a warning in this case). Usually this is not a problem, but you should be aware of it.

----------

## khayyam

 *cboldt wrote:*   

> 
> 
> ```
> for i in `twprint --print-dbfile | grep "^ /"`; do
> 
> ...

 

cboldt ... that will call find for each iteration of $i, its probably a better idea to store the list in an array or some such ...

```
declare -a filelist=$(find /var/db/pkg/ -name "CONTENTS")

for i in $(twprint --print-dbfile | grep "^ /") ; do

  grep -q " $i " ${filelist[@]} ;

done
```

I say "probably" as its late for me and I'm about to collapse into oblivion :)

best ... khay

----------

## cboldt

I like the idea of putting the CONTENTS filenames into a variable, probably speeds things up a bit compared with building the list a few thousand times.  No need to use `find` to build the array, this assignment works:

```
filelist="/var/db/pkg/*/*/CONTENTS"
```

I don't see a need for an array variable, there is no reason to isolate any particular element or section of the string, and $filelist has the same contents as ${filelist[@]}.

I think `grep string ${filelist[@]}` bumps into exactly the same potential argument size/length limitation (that `find -exec {} +` invocation avoids) that `grep string /var/db/pkg/*/*/CONTENTS` bumps into.

Not that there is much risk, the expansion of /var/db/pkg/*/*/CONTENTS is safely clear of the argument length limit.

```
> ls -d /var/db/pkg/*/*/CONTENTS | wc

    952     952   45770

> getconf ARG_MAX

2097152
```

That's 952 packages, and an argument length of less than 46000 bytes.

Checking some fraction of /usr/portage is a whole 'nother situation!

```
> find /usr/portage -name *.ebuild | wc

  37605   37605 2230933
```

----------

## jonathan183

 *cboldt wrote:*   

> Just tinkering around with your script snippets, and will share some ways to speed things up.  Rather than start with the tripwire database to find out installed packages, this integrity check first finds all the installed packages by view of /var/db/pkg ...

 

What I thought I was doing was running a scan on the system for suspect files (files that have changed in some way since the last update of the tripwire database). Then using package manager information (via equery) to verify all the files that I could as being OK. This then left me with two sets of files to investigate - those which have md5sums that don't match expected, and those that the package manager doesn't think are owned by a package.

After checking everything on the system is OK I'd update the tripwire database so next time I run the script again I only have to look at things which have changed.

I currently have the pain of working out what has modified the few hundred files and symlinks (due to package updates, removals and switching to read only root with separate /var /tmp /usr since the last tripwire database update), but once I have done this and the tripwire database is updated I won't need to look at them again unless they change.

Your approach seems quite different, starting from a list of all files installed on the system, which then gets compares with the list of files in the tripwire database.

... by running the scan I let tripwire tell me what files have changed. Unless I am missing something you are treating files in the same way if they are in the tripwire database irrespective of whether tripwire thinks they are OK - which would either accepts things that shouldn't be (which would be bad) or flags for checking things that don't need to be (which means I end up checking things time after time). I think it may be my understanding of the scripts which is incorrect, can you help me out with this ...

Regarding other suggestions re awk, grep and sed ... I'm sure you are right, thanks for pointing out the examples.

exi I figured was only going to access information I could obtain from equery and grep ... and to be honest I need the practice!

----------

## cboldt

 *Quote:*   

> This then left me with two sets of files to investigate - those which have md5sums that don't match expected, and those that the package manager doesn't think are owned by a package.

 

My impression was that you had three sets of "suspect" files.

Those that tripwire flagged.

Those that the package manager knows about, but md5sum mismatch.

Those that the package manager doesn't know about.

I was trying to expand the size of (at least) the third list, by finding all of the files that tripwire is checking that aren't known to the package manager; and maybe expand the size of the second list, by checking 100% of the files that the package manager knows about.

I also remarked that the third list might be expanded by using a cruft checker.

The list of files flagged by tripwire is what it is.  I think you said you had 17 of them.  Rather than check all of the files owned by the packages that threw those 17 errors, you could check just those 17 files, yes?  Once those flagged files were found to be (or made to be) acceptable, you could update the tripwire database using `tripwire --update`, which only acts on the files reported by `tripwire --check`.

You also mentioned that your "full file list" has 5815 lines.  That's a mighty short "full file list," and one that appears to originate from tripwire.  In comparison, here is the way that I used to count the "obj" entries in installed packages:

```
> cat /var/db/pkg/*/*/CONTENTS | grep "obj " | wc

194155  776985 20873849
```

That's some 194 thousand files, and that list is incomplete, compared with all the files that actually exist!

The package/md5sum check is different from the tripwire check, meaning that it is looking at a different group of files or different "full file list."  There is probably some overlap; there are files in the tripwire check that are missing from the package/md5sum check; and there are files in the package/md5sum check that are missing from the tripwire check.

I see the tripwire check and the package/md5sum check as independent, correlated only where both checks are looking at the same files.  My objective was to make the list of suspect files as big as possible.  Most of the suspect files are easily cleared, you will know at least why they have been changed, if not how.  There will be a few true mystery files (lots of them around gcc and gcc-config) to provide research entertainment   :Shocked: 

----------

## cboldt

Useless use of cat alert!  Dang.

grep "obj " /var/db/pkg/*/*/CONTENTS | wc

----------

## jonathan183

 *cboldt wrote:*   

> My impression was that you had three sets of "suspect" files.
> 
> Those that tripwire flagged.
> 
> Those that the package manager knows about, but md5sum mismatch.
> ...

 

I had suspect files identified by tripwire (the full file list - just what I extracted from the tripwire report) which I was trying to use the package manager information to verify the majority of them - which resulted in two file lists needing me to investigate further:-

(a) those the package manager knows about but says they don't match expected information e.g. md5sum

(b) those the package manager does not think are owned by packages

Tripwire identifies 5815 items, of which 20 failed package checks (a) above, and 196 are not owned by packages (b) above.

I gave an example of (a) in the original post - vim, an example of (b) was /lib/cpp

Your script helps reduce list (b) by allowing me to eliminate directories and symlinks to package owned files   :Smile: 

Thanks for your explanation ... it certainly gives me a few things to think about   :Cool: 

----------

## jonathan183

 *khayyam wrote:*   

> jonathan183 ... you should try to avoid using 'cat' like this (called UUoC, "useless use of cat"), its an extra call and pipe, and as 'grep' can be provided the filename as input its "useless".

  guilty as charged ... and probably a few others on the awards page   :Wink: 

 *Quote:*   

> note I'm missing the input file in the above ... doh!

 

sample from file - it's just a redirect of tripwire output

```
Parsing policy file: /etc/tripwire/tw.pol

*** Processing Unix File System ***

Performing integrity check...

Wrote report file: /var/lib/tripwire/report/Desktop-PC-20140301-145726.twr

Open Source Tripwire(R) 2.4.2.2 Integrity Check Report

...

-------------------------------------------------------------------------------

Rule Name: Tripwire CFG and Data (/etc/tripwire/tw.cfg)

Severity Level: 100

-------------------------------------------------------------------------------

Modified:

"/etc/tripwire/tw.cfg"

-------------------------------------------------------------------------------

Rule Name: Tripwire CFG and Data (/etc/tripwire/tw.pol)

Severity Level: 100

-------------------------------------------------------------------------------

Modified:

"/etc/tripwire/tw.pol"

```

 *Quote:*   

> Something similar came up in another thread where the md5sum was checked from CONTENTS ... maybe of some interest. 

 

I had been copying /etc from one system to another and then modifying, identifying which files I actually modified suggested in the thread you linked to is a better solution thanks.

----------

## jonathan183

 *mv wrote:*   

> You might want to have a look at find_cruft of the mv overlay. It has a default list of exceptions but allows for rather generic configuration files to modify/extend it to your purpose. The problem is that people consider different things as "cruft" (/usr/src or some files under /var are such examples). [advertisement] Also, I think that the configurable handling of symlinks (so that a file in /lib64/Bla.so but recorded in the database as /lib/Bla.so will not considered as craft if /lib is a symlink to /lib64) is a rather unique feature of find_cruft [/advertisement]

 

interesting ... I'll take a look thanks   :Cool: 

----------

## jonathan183

OK this journey has been good for me   :Cool: 

I have come to the conclusion there is no automated tool for determining files etc that belong to a package. The data as far as it goes is contained in the /var/db/pkg tree. That after eliminating files (obj), directories (dir) and symbolic links (sym) there are still quite a few files/directories/symlinks left as suspects.

It is possible to split things into deleted files/directories/symlinks, symbolic links, files in /lib/modules, and character special. This results in a number of suspect files which must be manually investigated and eliminated.

For me the sensible way of dealing with this is storing known good file lists together with md5sum values, symbolic links with targets etc so they can be compared with current lists to allow automated removal of suspect items not verified by the package manager.

Differences between previous and current scans can be used to focus effort on required areas to investigate, and will be needed to avoid multiple investigation of the same files each time a scan is completed.

The package management is kind of missing an http://boingboing.net/2007/08/20/flowchart-is-it-fcke.html category  :Rolling Eyes: 

----------

